Re: strlen

public inbox for gcc-help@gcc.gnu.org
 help / color / mirror / Atom feed

* Re: strlen
       [not found]                     ` <5566b180-1333-d73b-22ee-6c6d32053921@jguk.org>
@ 2021-07-08 11:06                       ` Alejandro Colomar (man-pages)
  2021-07-08 12:05                         ` strlen johnsfine
                                           ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Alejandro Colomar (man-pages) @ 2021-07-08 11:06 UTC (permalink / raw)
  To: Jonny Grant; +Cc: linux-man, Michael Kerrisk, Florian Weimer, gcc-help

On 7/8/21 12:07 PM, Jonny Grant wrote:
> Thank you for your reply.
> 
> We can't guarantee safestrlen() won't be called with NULL. So because strlen() itself doesn't check for NULL in C standard we'd need to call the wrapper so that NULL can be checked for.
> 
> I'd like to avoid the compiler removing certain execution paths.
> I'd rather keep all code paths, even if they are not taken, just in case a NULL pointer creeps in due to an external device that is connected to an embedded system.
> 
> 
> Probably this would work:
> 
> size_t __attribute__((optimize("O0"))) safestrlen(const char * s)
> {
>      if (NULL == s) return 0;
>      else return strlen(s);
> }

I don't think you don't need that.  Unless there's a bug in GCC, it 
shouldn't optimize that path unless it is 100% sure that it will never 
be called.

Moreover, I recommend you to optimize as much as possible.
Even though NULL is possible in your code, I guess it's unlikely.

Also, calling a function safe is too generic.
I'd call it with the suffix null, as it act different on null.

Also, I recommend avoiding 'size_t' (and any other unsigned types, BTW).
See <https://google.github.io/styleguide/cppguide.html#Integer_Types>.
Use the POSIX type 'ssize_t'.
That also allows differentiating a length of 0 (i.e., "") from an 
invalid string (i.e., NULL), by returning -1 for NULL.

Recommended implementation (requires C99 or later due to the usage of 
inline):

[
// compiler.h

#define likely(x)    __builtin_expect((x), 1)
#define unlikely(x)  __builtin_expect((x), 0)
]

[
// strlennull.h

#include <string.h>
#include <sys/types.h>

#include "compiler.h"

[[gnu::pure]]
inline ssize_t strlennull(const char *s);

inline ssize_t strlennull(const char *s)
{
	if (unlikely(!s))
		return -1;
	return strlen(s);
}
]

[
// strlennull.c

#include "strlennull.h"

#include <sys/types.h>

extern inline ssize_t strlennull(const char *s);
]

BTW, if the input is so untrusted that NULL may be possible, I guess it 
might as well contain invalid strings (non-NUL-terminated), for which 
strlen(3) would behave as bad or worse.  So I recommend having a look at 
strnlen(3) and maybe implement strnlennull() instead of strlennull().

Unless you _know_ there's a compiler bug that doesn't allow you to 
optimize, please optimize.  Otherwise, if it's just a bit of paranoia, 
you could as well not optimize any code at all.  Specifically not 
optimizing this code by the use of pragmas or attributes is *wrong*. 
Just revise the preprocessor output, the compiler output, and also try 
introducing a NULL at run time, to check that everything's fine.

> 
> I also use 'volatile' for reads/writes to addresses that I don't want to be optimized out.

Are you sure you don't want to optimize?  Are those addresses hardware 
I/O or interrupts?

Maybe you just want membarrier(2).

> 
>>
>> What I recommend you to do from time to time, to make sure you don't miss any warnings, is compile the whole project first with '-O3' and then with '-O0'.  If you are a bit paranoic, sporadically you can try all of them : '-Og', '-O0', '-O1', '-Os', '-O2', '-O3', '-Ofast' but I don't think that is necessary.  Of course, always use '-fanalyzer' (GCC 10 and above).
> 
> Yes, I am looking forward to David Malcom's -fanalyzer when Ubuntu LTS next upgrades, I'm on gcc 9.3 today. But -fanalyzer is only for C anyway.. much of of code base I work with is compiled as C++ so I can't use -fanalyzer yet.

You can install gcc-10 for Bionic: apt-get install gcc-10
I recommend it.  It finds bugs very deep in the code.

Cheers,

Alex

-- 
Alejandro Colomar
Linux man-pages comaintainer; https://www.kernel.org/doc/man-pages/
http://www.alejandro-colomar.es/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: strlen
  2021-07-08 11:06                       ` strlen Alejandro Colomar (man-pages)
@ 2021-07-08 12:05                         ` johnsfine
  2021-07-09  2:08                           ` strlen LIU Hao
  2021-07-08 12:13                         ` strlen Xi Ruoyao
                                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 15+ messages in thread
From: johnsfine @ 2021-07-08 12:05 UTC (permalink / raw)
  To: gcc-help, jg; +Cc: gcc-help, linux-man, fw, mtk.manpages

 This is not the forum for such a discussion.  But I want to make people reading this aware that many expert C and C++ programmers (likely a majority) consider that advice to avoid unsigned types to be horrible advice.
I advise people to avoid signed types and I do so myself.  If an integer value won't be negative, it shouldn't be signed.  That makes your intent clearer to anyone reading your code, and (especially in x86-64) lets the compiler generate smaller and faster code.
If you aren't an expert, I strongly suggest letting the library functions guide you in choice of data type.  Where it takes a size_t, your variables that exist primarily to be passed that way should be size_t.  Where it returns std::size_t, your variables that exist primarily to take that return should be size_t and especially, typical functions you write to parallel those functions should have the same return type.  This is a rule I usually break myself, because I write a lot of code in which speed is critical and in which various cache issues dominate speed (rather than actual computations dominating speed), because I know when values will reliably be less than 2 to 32, and because there are many usages in which 32 bit unsigned is the most efficient x86-64 data type (more efficient that size_t and much more efficient than int).  But for most programmers, the safety and clarity and portability of using size_t (rather than 32 bit unsigned) in such situations outweighs the performance difference.

Obviously, others disagree.  I hope I'm not starting a big discussion here (I don't mind a big discussion somewhere, but not here).  I just wanted to balance the one sided opinion (that I have seen too many times recently).

-----Original Message-----
From: Alejandro Colomar (man-pages) via Gcc-help <gcc-help@gcc.gnu.org>
To: Jonny Grant <jg@jguk.org>
Cc: gcc-help@gcc.gnu.org; linux-man <linux-man@vger.kernel.org>; Florian Weimer <fw@deneb.enyo.de>; Michael Kerrisk <mtk.manpages@gmail.com>
Sent: Thu, Jul 8, 2021 7:06 am
Subject: Re: strlen

On 7/8/21 12:07 PM, Jonny Grant wrote:
> Thank you for your reply.
> 
> We can't guarantee safestrlen() won't be called with NULL. So because strlen() itself doesn't check for NULL in C standard we'd need to call the wrapper so that NULL can be checked for.
> 
> I'd like to avoid the compiler removing certain execution paths.
> I'd rather keep all code paths, even if they are not taken, just in case a NULL pointer creeps in due to an external device that is connected to an embedded system.
> 
> 
> Probably this would work:
> 
> size_t __attribute__((optimize("O0"))) safestrlen(const char * s)
> {
>      if (NULL == s) return 0;
>      else return strlen(s);
> }

I don't think you don't need that.  Unless there's a bug in GCC, it 
shouldn't optimize that path unless it is 100% sure that it will never 
be called.

Moreover, I recommend you to optimize as much as possible.
Even though NULL is possible in your code, I guess it's unlikely.

Also, calling a function safe is too generic.
I'd call it with the suffix null, as it act different on null.

Also, I recommend avoiding 'size_t' (and any other unsigned types, BTW).
See <https://google.github.io/styleguide/cppguide.html#Integer_Types>.
Use the POSIX type 'ssize_t'.
That also allows differentiating a length of 0 (i.e., "") from an 
invalid string (i.e., NULL), by returning -1 for NULL.

Recommended implementation (requires C99 or later due to the usage of 
inline):

[
// compiler.h

#define likely(x)    __builtin_expect((x), 1)
#define unlikely(x)  __builtin_expect((x), 0)
]

[
// strlennull.h

#include <string.h>
#include <sys/types.h>

#include "compiler.h"

[[gnu::pure]]
inline ssize_t strlennull(const char *s);

inline ssize_t strlennull(const char *s)
{
    if (unlikely(!s))
        return -1;
    return strlen(s);
}
]

[
// strlennull.c

#include "strlennull.h"

#include <sys/types.h>

extern inline ssize_t strlennull(const char *s);
]

BTW, if the input is so untrusted that NULL may be possible, I guess it 
might as well contain invalid strings (non-NUL-terminated), for which 
strlen(3) would behave as bad or worse.  So I recommend having a look at 
strnlen(3) and maybe implement strnlennull() instead of strlennull().

Unless you _know_ there's a compiler bug that doesn't allow you to 
optimize, please optimize.  Otherwise, if it's just a bit of paranoia, 
you could as well not optimize any code at all.  Specifically not 
optimizing this code by the use of pragmas or attributes is *wrong*. 
Just revise the preprocessor output, the compiler output, and also try 
introducing a NULL at run time, to check that everything's fine.

> 
> I also use 'volatile' for reads/writes to addresses that I don't want to be optimized out.

Are you sure you don't want to optimize?  Are those addresses hardware 
I/O or interrupts?

Maybe you just want membarrier(2).

> 
>>
>> What I recommend you to do from time to time, to make sure you don't miss any warnings, is compile the whole project first with '-O3' and then with '-O0'.  If you are a bit paranoic, sporadically you can try all of them : '-Og', '-O0', '-O1', '-Os', '-O2', '-O3', '-Ofast' but I don't think that is necessary.  Of course, always use '-fanalyzer' (GCC 10 and above).
> 
> Yes, I am looking forward to David Malcom's -fanalyzer when Ubuntu LTS next upgrades, I'm on gcc 9.3 today. But -fanalyzer is only for C anyway.. much of of code base I work with is compiled as C++ so I can't use -fanalyzer yet.

You can install gcc-10 for Bionic: apt-get install gcc-10
I recommend it.  It finds bugs very deep in the code.

Cheers,

Alex

-- 
Alejandro Colomar
Linux man-pages comaintainer; https://www.kernel.org/doc/man-pages/
http://www.alejandro-colomar.es/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: strlen
  2021-07-08 12:05                         ` strlen johnsfine
@ 2021-07-09  2:08                           ` LIU Hao
  2021-07-09 17:26                             ` strlen Martin Sebor
  0 siblings, 1 reply; 15+ messages in thread
From: LIU Hao @ 2021-07-09  2:08 UTC (permalink / raw)
  To: johnsfine, gcc-help, jg; +Cc: linux-man, fw, mtk.manpages

[-- Attachment #1.1: Type: text/plain, Size: 1256 bytes --]

在 7/8/21 8:05 PM, johnsfine--- via Gcc-help 写道:
>   This is not the forum for such a discussion.  But I want to make people reading this aware that many expert C and C++ programmers (likely a majority) consider that advice to avoid unsigned types to be horrible advice.
> I advise people to avoid signed types and I do so myself.  If an integer value won't be negative, it shouldn't be signed.  That makes your intent clearer to anyone reading your code, and (especially in x86-64) lets the compiler generate smaller and faster code.

That makes no sense. Would you prefer unsigned integers to signed ones, for something that can only 
be one of {1,2,3,4}? Just because something can't be negative, does not mean it should be unsigned.

Conversion between `uint64_t` and `double` is much slower than `int64_t` on some platforms, e.g. 
x86, so 'smaller and faster code' is also groundless [1].

A signed integer is a value that denotes the number or amount of something, such as bytes, 
characters, files, seconds, kilometers, or even dollars. An unsigned integer is merely a fixed-sized 
collection of bits. That's the only fundamental difference.

[1] https://gcc.godbolt.org/z/6YY61rhGe

-- 
Best regards,
LIU Hao

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 840 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: strlen
  2021-07-09  2:08                           ` strlen LIU Hao
@ 2021-07-09 17:26                             ` Martin Sebor
  2021-07-09 19:51                               ` strlen johnsfine
  2021-07-09 20:19                               ` strlen Alejandro Colomar (man-pages)
  0 siblings, 2 replies; 15+ messages in thread
From: Martin Sebor @ 2021-07-09 17:26 UTC (permalink / raw)
  To: LIU Hao, johnsfine, gcc-help, jg; +Cc: linux-man, fw, mtk.manpages

On 7/8/21 8:08 PM, LIU Hao via Gcc-help wrote:
> 在 7/8/21 8:05 PM, johnsfine--- via Gcc-help 写道:
>>   This is not the forum for such a discussion.  But I want to make 
>> people reading this aware that many expert C and C++ programmers 
>> (likely a majority) consider that advice to avoid unsigned types to be 
>> horrible advice.
>> I advise people to avoid signed types and I do so myself.  If an 
>> integer value won't be negative, it shouldn't be signed.  That makes 
>> your intent clearer to anyone reading your code, and (especially in 
>> x86-64) lets the compiler generate smaller and faster code.
> 
> That makes no sense. Would you prefer unsigned integers to signed ones, 
> for something that can only be one of {1,2,3,4}? Just because something 
> can't be negative, does not mean it should be unsigned.

There are benefits to making that explicit by choosing an unsigned
type: the result of converting a narrower unsigned type to a wider
unsigned type is unchanged.  The result of converting it to a wider
signed type may change.  This has an impact on value range propagation
which in turn can influence other optimizations as well as warnings.

Choosing an unsigned type has liabilities as well.  Because unsigned
arithmetic wraps around zero whereas signed arithmetic does not (it
overflows which is undefined), its results are less constrained than
those of signed arithmetic.

Martin

> 
> Conversion between `uint64_t` and `double` is much slower than `int64_t` 
> on some platforms, e.g. x86, so 'smaller and faster code' is also 
> groundless [1].
> 
> 
> A signed integer is a value that denotes the number or amount of 
> something, such as bytes, characters, files, seconds, kilometers, or 
> even dollars. An unsigned integer is merely a fixed-sized collection of 
> bits. That's the only fundamental difference.
> 
> 
> [1] https://gcc.godbolt.org/z/6YY61rhGe
> 
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: strlen
  2021-07-09 17:26                             ` strlen Martin Sebor
@ 2021-07-09 19:51                               ` johnsfine
  2021-07-09 20:19                               ` strlen Alejandro Colomar (man-pages)
  1 sibling, 0 replies; 15+ messages in thread
From: johnsfine @ 2021-07-09 19:51 UTC (permalink / raw)
  To: msebor, lh_mouse, gcc-help, jg; +Cc: linux-man, fw, mtk.manpages

 > Choosing an unsigned type has liabilities as well.  Because unsigned
> arithmetic wraps around zero whereas signed arithmetic does not (it
> overflows which is undefined), its results are less constrained than
> those of signed arithmetic.
In the vast majority of real programming cases in which the defined overflow of unsigned causes a significant performance weakness, signed has a different but typically worse performance weakness.
I wish gcc had some attribute to correct this flaw.  (If that has been added since I last checked, please tell me).
The common example (that an expert will recognize as an example of a more common issue) is:
p[x+1]  where x is a 32 bit unsigned.

p[ size_t(x) + 1 ] very often generates smaller/faster code.  But that is too tedious for the programmer to insert and the programmer should not need to know when that would help.

The programmer often knows that size_t(x)+1 would behave the same as size_t(x+1), but the compiler does not know that.The compiler knows which of size_t(x)+1 vs. size_t(x+1) is better in a specific usage.  The programmer may know that one or the other can be better, but shouldn't need to worry about which.

Usually those who want this feature express it as a data type like uint32_t but with undefined overflow.  So the compiler can then know that size_t(x)+1 would behave the same as size_t(x+1) because the code behavior would be undefined if that were not true.
I would prefer an "unspecified promotion" attribute to an "undefined overflow" attribute, so a type that is stored as 32 bit unsigned, but at the compiler's convenience MIGHT be promoted to a larger type anywhere up the path (within the tree) from where it occurs to where it would necessarily stop being a 32 bit unsigned anyway.
Example p[x*q+r] in the common case that p is of some type such that this is actually an inlined call to operator[](size_t)
assuming q and r are also smaller than size_t, and x is of the type with unspecified promotion, we have size_t(x*q+r) but the compiler should freely be able to choose size_t(x*q)+r or size_t(x)*q+r, whichever of those optimizes best.
In almost all cases that really matter, such an unspecified promotion vs. undefined overflow would allow exactly the same optimizations.  In the obscure cases where one would allow an optimization that the other would not, I think the programmer would be much less surprised by the "unspecified promotion" behavior than "undefined overflow".
But all of this is just a wish for slightly better.  In the world as it exists now, choosing a 32 bit unsigned as an index type produces smaller faster code than either a 64 bit unsigned or either size signed.  The cases where a 64 bit index type (either signed or unsigned) is smaller/faster than 32 bit would be fixed by such an attribute, but even without fixing that, they are dwarfed by the cases where a 32 bit unsigned index type is better than 64 bit.  Through all of that, the beginner choice of 32 bit signed for an indexed type produces the least efficient code.

All of this also only applies in the case where you are really trying to push performance and you are really sure there won't be more than 4 billion items in one level of one container.  Otherwise (and for beginners, regardless) your index type should be size_t because that is the library's index type.
I have done a lot programming in situations where the performance of the index type is critical, and even though the total number of objects exceeds 4 billion, the total number of objects visible to a single indexing operation cannot.  I always do something like:typedef unsigned index_t;and then use index_t for all indexes.That allows a test build changing just that typedef to use std::size_t to verify both that the results are the same and that those same results take significantly longer.  I think there are only a dozen places in a massive amount of code where profiling then told me to do just a bit better by changing p[x+1] to (p+1)[x] where p was an iterator and x was an index.  The optimizer would get to the same place with p[size_t(x)+1], which also works for vectors, but I find (p+1)[x] clearer and I tended to access through iterators even when the containers were vectors.

-----Original Message-----
From: Martin Sebor <msebor@gmail.com>
To: LIU Hao <lh_mouse@126.com>; johnsfine@verizon.net; gcc-help@gcc.gnu.org <gcc-help@gcc.gnu.org>; jg@jguk.org <jg@jguk.org>
Cc: linux-man@vger.kernel.org <linux-man@vger.kernel.org>; fw@deneb.enyo.de <fw@deneb.enyo.de>; mtk.manpages@gmail.com <mtk.manpages@gmail.com>
Sent: Fri, Jul 9, 2021 1:26 pm
Subject: Re: strlen

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: strlen
  2021-07-09 17:26                             ` strlen Martin Sebor
  2021-07-09 19:51                               ` strlen johnsfine
@ 2021-07-09 20:19                               ` Alejandro Colomar (man-pages)
  1 sibling, 0 replies; 15+ messages in thread
From: Alejandro Colomar (man-pages) @ 2021-07-09 20:19 UTC (permalink / raw)
  To: Martin Sebor, LIU Hao, johnsfine, gcc-help, jg
  Cc: linux-man, fw, mtk.manpages


On 7/9/21 7:26 PM, Martin Sebor wrote:
> On 7/8/21 8:08 PM, LIU Hao via Gcc-help wrote:
>> 在 7/8/21 8:05 PM, johnsfine--- via Gcc-help 写道:
>>>   This is not the forum for such a discussion.  But I want to make 
>>> people reading this aware that many expert C and C++ programmers 
>>> (likely a majority) consider that advice to avoid unsigned types to 
>>> be horrible advice.

I'm sorry if it's not the right place, I could remove the lists
from the CC if it's too noisy, but I think an important point is
here, and a couple of emails won't damage too much the mailing lists.

On the other hand, I consider bad etiquette removing CCs from
a discussion.

>>> I advise people to avoid signed types and I do so myself.  If an 
>>> integer value won't be negative, it shouldn't be signed.  That makes 
>>> your intent clearer to anyone reading your code, and (especially in 
>>> x86-64) lets the compiler generate smaller and faster code.

As others have showed with facts, and the Google style guide also hints,
using unsigned types just removes the opportunity of the compiler
to optimize on overflow, because it has to account for wrapping around.

>>
>> That makes no sense. Would you prefer unsigned integers to signed 
>> ones, for something that can only be one of {1,2,3,4}? Just because 
>> something can't be negative, does not mean it should be unsigned.
That.
The same way that because a number is never going to be
greater than 100 is it any better to use [u]int256_t.
Just use int, unless you have a reason not to.

Please John, read the paper from Bjarne about unsigned types,
it really covers a lot.  If you still disagree after reading that,
feel free to argument it.

<http://open-std.org/JTC1/SC22/WG21/docs/papers/2019/p1428r0.pdf>

> 
> There are benefits to making that explicit by choosing an unsigned
> type: the result of converting a narrower unsigned type to a wider
> unsigned type is unchanged.  The result of converting it to a wider
> signed type may change.  This has an impact on value range propagation
> which in turn can influence other optimizations as well as warnings.

That problem with implicit conversions to larger types is not a
problem of signed integers, not even a problem of unsigned integers.
It's a problem of mixing both.
If you use signed integers for everything, you won't have that problem.


Alex


-- 
Alejandro Colomar
Linux man-pages comaintainer; https://www.kernel.org/doc/man-pages/
http://www.alejandro-colomar.es/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: strlen
  2021-07-08 11:06                       ` strlen Alejandro Colomar (man-pages)
  2021-07-08 12:05                         ` strlen johnsfine
@ 2021-07-08 12:13                         ` Xi Ruoyao
  2021-07-08 23:49                         ` strlen Segher Boessenkool
  2021-07-09 10:50                         ` strlen Jonny Grant
  3 siblings, 0 replies; 15+ messages in thread
From: Xi Ruoyao @ 2021-07-08 12:13 UTC (permalink / raw)
  To: Alejandro Colomar (man-pages), Jonny Grant
  Cc: gcc-help, linux-man, Florian Weimer, Michael Kerrisk

On Thu, 2021-07-08 at 13:06 +0200, Alejandro Colomar (man-pages) via
Gcc-help wrote:
> On 7/8/21 12:07 PM, Jonny Grant wrote:
> > Thank you for your reply.
> > 
> > We can't guarantee safestrlen() won't be called with NULL. So because
> > strlen() itself doesn't check for NULL in C standard we'd need to call
> > the wrapper so that NULL can be checked for.
> > 
> > I'd like to avoid the compiler removing certain execution paths.
> > I'd rather keep all code paths, even if they are not taken, just in
> > case a NULL pointer creeps in due to an external device that is
> > connected to an embedded system.

If you are taking a pointer from external device "correctly", gcc won't
delete your NULL checking path.  For example:

// defined by linker script
extern volatile char *an_io_port_providing_a_pointer;

int f()
{
    char *ptr = an_io_port_providing_a_pointer;

    // C standard disallows to remove it
    if (ptr == NULL) {
        gracefully_report_bug("some message");
        return -EINVAL;
    }

    return g(ptr);
}

Or

// in assembly
extern char *read_pointer_from_io_port(int io_port_id);

int f()
{
    char *ptr = read_pointer_from_io_port(IO_PORT_A);

    // C standard disallows to remove it
    if (ptr == NULL) {
        gracefully_report_bug("some message");
        return -EINVAL;
    }

    return g(ptr);
}

OTOH, if you are taking the pointer from external input incorrectly (i.
e. violating C standard and invoking some UB), even if you used some way
to enforce the compiler to keep the NULL checking, it would be still
unsafe.

Even if you want to be "careful" (I'd rather call this "paranoid"), you
can use -fno-delete-null-pointer-checks, instead of turning off all
optimizations.

And, GCC "optimize" attribute/pragma is somewhat buggy and only intended
for debugging GCC.  If you need to turn off some optmization for a
function, it's better to put the function into a seperate TU and use
command line option to disable the optimization.

By the way, if C can't provide the safety feature you need (for example
programming something launching a nuclear missile :), maybe it's better
to use Ada or something.

-- 
Xi Ruoyao <xry111@mengyan1223.wang>
School of Aerospace Science and Technology, Xidian University

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: strlen
  2021-07-08 11:06                       ` strlen Alejandro Colomar (man-pages)
  2021-07-08 12:05                         ` strlen johnsfine
  2021-07-08 12:13                         ` strlen Xi Ruoyao
@ 2021-07-08 23:49                         ` Segher Boessenkool
  2021-07-09 13:54                           ` strlen Jonny Grant
  2021-07-09 10:50                         ` strlen Jonny Grant
  3 siblings, 1 reply; 15+ messages in thread
From: Segher Boessenkool @ 2021-07-08 23:49 UTC (permalink / raw)
  To: Alejandro Colomar (man-pages)
  Cc: Jonny Grant, gcc-help, linux-man, Florian Weimer, Michael Kerrisk

On Thu, Jul 08, 2021 at 01:06:17PM +0200, Alejandro Colomar (man-pages) via Gcc-help wrote:
> On 7/8/21 12:07 PM, Jonny Grant wrote:
> >We can't guarantee safestrlen() won't be called with NULL. So because 
> >strlen() itself doesn't check for NULL in C standard we'd need to call the 
> >wrapper so that NULL can be checked for.

> >size_t __attribute__((optimize("O0"))) safestrlen(const char * s)
> >{
> >     if (NULL == s) return 0;
> >     else return strlen(s);
> >}

> That also allows differentiating a length of 0 (i.e., "") from an 
> invalid string (i.e., NULL), by returning -1 for NULL.

It is incorrect to return any particular value for strlen(0); not 0, not
-1, not anything.  Since there *is* no string, it doesn't have a length
either.

So instead of making some function for this, I recommend just writing
something like

  bla = s ? strlen(s) : 0;

wherever you need it.  If a function name isn't self-explanatory, and
even *cannot* be, your factoring is most likely not ideal.  Code is
primarily there for humans to read, it should be optimised for that.


Segher

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: strlen
  2021-07-08 23:49                         ` strlen Segher Boessenkool
@ 2021-07-09 13:54                           ` Jonny Grant
  2021-07-09 14:17                             ` strlen Alejandro Colomar (man-pages)
  0 siblings, 1 reply; 15+ messages in thread
From: Jonny Grant @ 2021-07-09 13:54 UTC (permalink / raw)
  To: Segher Boessenkool, Alejandro Colomar (man-pages)
  Cc: gcc-help, linux-man, Florian Weimer, Michael Kerrisk



On 09/07/2021 00:49, Segher Boessenkool wrote:
> On Thu, Jul 08, 2021 at 01:06:17PM +0200, Alejandro Colomar (man-pages) via Gcc-help wrote:
>> On 7/8/21 12:07 PM, Jonny Grant wrote:
>>> We can't guarantee safestrlen() won't be called with NULL. So because 
>>> strlen() itself doesn't check for NULL in C standard we'd need to call the 
>>> wrapper so that NULL can be checked for.
> 
>>> size_t __attribute__((optimize("O0"))) safestrlen(const char * s)
>>> {
>>>     if (NULL == s) return 0;
>>>     else return strlen(s);
>>> }
> 
>> That also allows differentiating a length of 0 (i.e., "") from an 
>> invalid string (i.e., NULL), by returning -1 for NULL.
> 
> It is incorrect to return any particular value for strlen(0); not 0, not
> -1, not anything.  Since there *is* no string, it doesn't have a length
> either.
> 
> So instead of making some function for this, I recommend just writing
> something like
> 
>   bla = s ? strlen(s) : 0;


Hi Segher

Yes, this could work. But it does rely on programmer typing it like that every time... Maybe an inline function better.

inline size_t safestrlen(const char * s) {return s?strlen(s) : 0}

Perhaps there are too many email addresses on this cc list now.

I'd prefer a Annex K of C11 style function ISO/IEC TR 24731-1 for strlen() - but there isn't one such as strnlen_s.


> 
> wherever you need it.  If a function name isn't self-explanatory, and
> even *cannot* be, your factoring is most likely not ideal.  Code is
> primarily there for humans to read, it should be optimised for that.
> 
> 
> Segher
> .

Good point
Jonny

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: strlen
  2021-07-09 13:54                           ` strlen Jonny Grant
@ 2021-07-09 14:17                             ` Alejandro Colomar (man-pages)
  2021-07-09 16:11                               ` strlen Xi Ruoyao
  2021-07-10  1:00                               ` strlen Segher Boessenkool
  0 siblings, 2 replies; 15+ messages in thread
From: Alejandro Colomar (man-pages) @ 2021-07-09 14:17 UTC (permalink / raw)
  To: Jonny Grant, Segher Boessenkool
  Cc: gcc-help, linux-man, Florian Weimer, Michael Kerrisk

Hi Jonny, Segher,

On 7/9/21 3:54 PM, Jonny Grant wrote:
> Yes, this could work. But it does rely on programmer typing it like that every time... Maybe an inline function better.

I agree on that.

> 
> inline size_t safestrlen(const char * s) {return s?strlen(s) : 0}
> 
> Perhaps there are too many email addresses on this cc list now.
> 
> I'd prefer a Annex K of C11 style function ISO/IEC TR 24731-1 for strlen() - but there isn't one such as strnlen_s.

Please, consider not calling some function safesomething() or similar, 
as it isn't 100% safe.  It's like calling some thing "the new X".  How 
will you call the next version?  "the nova X"? And the next? "the 
supernew X"?

As I said before, unsigned types are unsafe, you may want to accept it 
or not, but they are.

Now, the day you realize that and develop an even safer function that 
doesn't use unsigned size_t, what will you call it?  supersafestrlen()?

Use names that define your functions as closely as possible.

> 
> 
> On 09/07/2021 00:49, Segher Boessenkool wrote:
>> wherever you need it.  If a function name isn't self-explanatory, and

Agree on this

>> even *cannot* be, your factoring is most likely not ideal.  Code is
But not on this.

You could call it strlennull(), that is, a strlen() that special-cases 
NULL.  I find it a good enough name, as long as you document your function.

It saves the problem of repeating yourself every time.

>> primarily there for humans to read, it should be optimised for that.
>>

Agree on this again, but I think the following is readable:

len = strlennull(maybenull);

Regards,

Alex

-- 
Alejandro Colomar
Linux man-pages comaintainer; https://www.kernel.org/doc/man-pages/
http://www.alejandro-colomar.es/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: strlen
  2021-07-09 14:17                             ` strlen Alejandro Colomar (man-pages)
@ 2021-07-09 16:11                               ` Xi Ruoyao
  2021-07-10  1:00                               ` strlen Segher Boessenkool
  1 sibling, 0 replies; 15+ messages in thread
From: Xi Ruoyao @ 2021-07-09 16:11 UTC (permalink / raw)
  To: Alejandro Colomar (man-pages), Jonny Grant, Segher Boessenkool
  Cc: gcc-help, linux-man, Florian Weimer, Michael Kerrisk

On Fri, 2021-07-09 at 16:17 +0200, Alejandro Colomar (man-pages) via
Gcc-help wrote:
> Hi Jonny, Segher,
> 
> On 7/9/21 3:54 PM, Jonny Grant wrote:
> > Yes, this could work. But it does rely on programmer typing it like
> > that every time... Maybe an inline function better.
> 
> I agree on that.
> 
> > 
> > inline size_t safestrlen(const char * s) {return s?strlen(s) : 0}
> > 
> > Perhaps there are too many email addresses on this cc list now.

I think the discussion at least has nothing to do with linux-man or gcc-
help: man pages just describe the existing API (C or POSIX or Linux
specific), and GCC just compiles code and doesn't care what the API is.
Neither is a place to discuss "how to design an API".

And I think Jonny should discuss the API design with the users of his
API (maybe his collegue or downstream developers), instead of some
random guys in mail list.  The users are the ones who will call his
function anyway so it's better to choose an API they like.  Yes, Jonny
can "force" the users to do something for safety, but this decision
should also be discussed with them and documented.  Or they won't
understand the decision, and may "invent" or "improvise" some "new
wheels", breaking Jonny's design.

For example, I don't like a function silently treats NULL as an empty
string.  I prefer a function to abort() or print a log "strlen_checked()
is called with NULL, there is a bug in your code" when I (mis)use NULL.
But it's just my 2 cents: if the potential users of the API agree the
function to act as that, then it's good to go.

-- 
Xi Ruoyao <xry111@mengyan1223.wang>
School of Aerospace Science and Technology, Xidian University

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: strlen
  2021-07-09 14:17                             ` strlen Alejandro Colomar (man-pages)
  2021-07-09 16:11                               ` strlen Xi Ruoyao
@ 2021-07-10  1:00                               ` Segher Boessenkool
  1 sibling, 0 replies; 15+ messages in thread
From: Segher Boessenkool @ 2021-07-10  1:00 UTC (permalink / raw)
  To: Alejandro Colomar (man-pages)
  Cc: Jonny Grant, gcc-help, linux-man, Florian Weimer, Michael Kerrisk

Fine, I'll bite :-)

On Fri, Jul 09, 2021 at 04:17:08PM +0200, Alejandro Colomar (man-pages) wrote:
> On 7/9/21 3:54 PM, Jonny Grant wrote:
> >Yes, this could work. But it does rely on programmer typing it like that 
> >every time... Maybe an inline function better.
> 
> I agree on that.

A function (or any other abstraction) can be fine for this, *iff* you
can make people use it correctly.  Since it is pretty much impossible to
give a good succinct name to this function, I posit that is not the
case.  Please feel free to prove me wrong (by coming up with a decent
name for it).

> >I'd prefer a Annex K of C11 style function ISO/IEC TR 24731-1 for strlen() 
> >- but there isn't one such as strnlen_s.
> 
> Please, consider not calling some function safesomething() or similar, 
> as it isn't 100% safe.  It's like calling some thing "the new X".  How 
> will you call the next version?  "the nova X"? And the next? "the 
> supernew X"?
> 
> As I said before, unsigned types are unsafe, you may want to accept it 
> or not, but they are.

I thought Annex K was great entertainment, but calling unsigned types
"unsafe" takes the cake.

Unsigned types are Z/nZ with n some power of two.  Signed types are not
even Z (overflow is undefined).  Unsigned types are useful for
describing many machine things.  They are useful for sizes, not only
because sizes cannot be negative, but also because sizes can be bigger
than the maximum positive number that can fit in the same size signed
number.  They are useful for bitty things, registers maybe, stuff in
memory, or device I/O registers.  And they are much more useful than C
signed numbers for holding memory addresses, where you need that (you
can do sane aritmetic on it).

Using unsigned types without range checking is often wrong ("unsafe" in
your words).  Using signed types without range checking is just as wrong
in the same cases, if not more (overflow is undefined).  At least in the
"unsigned" case it is *possible* its behaviour is what the programmer
intended!

> Agree on this again, but I think the following is readable:
> 
> len = strlennull(maybenull);

If you use it a million times, of course you can give it a short and
non-sensical name, and expect the users to learn what it means.  If not,
it is better to be slightly more verbose, and reduce the mental load.

Segher

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: strlen
  2021-07-08 11:06                       ` strlen Alejandro Colomar (man-pages)
                                           ` (2 preceding siblings ...)
  2021-07-08 23:49                         ` strlen Segher Boessenkool
@ 2021-07-09 10:50                         ` Jonny Grant
  2021-07-09 11:27                           ` strlen Alejandro Colomar (man-pages)
  3 siblings, 1 reply; 15+ messages in thread
From: Jonny Grant @ 2021-07-09 10:50 UTC (permalink / raw)
  To: Alejandro Colomar (man-pages)
  Cc: linux-man, Michael Kerrisk, Florian Weimer, gcc-help



On 08/07/2021 12:06, Alejandro Colomar (man-pages) wrote:
> On 7/8/21 12:07 PM, Jonny Grant wrote:
>> Thank you for your reply.
>>
>> We can't guarantee safestrlen() won't be called with NULL. So because strlen() itself doesn't check for NULL in C standard we'd need to call the wrapper so that NULL can be checked for.
>>
>> I'd like to avoid the compiler removing certain execution paths.
>> I'd rather keep all code paths, even if they are not taken, just in case a NULL pointer creeps in due to an external device that is connected to an embedded system.
>>
>>
>> Probably this would work:
>>
>> size_t __attribute__((optimize("O0"))) safestrlen(const char * s)
>> {
>>      if (NULL == s) return 0;
>>      else return strlen(s);
>> }
> 
> I don't think you don't need that.  Unless there's a bug in GCC, it shouldn't optimize that path unless it is 100% sure that it will never be called.

That is good, so the code will always be kept! As compiler will never find all calls to strlen() and be sure those calls are never NULL.

> Moreover, I recommend you to optimize as much as possible.
> Even though NULL is possible in your code, I guess it's unlikely.
> 
> Also, calling a function safe is too generic.
> I'd call it with the suffix null, as it act different on null.
> 
> Also, I recommend avoiding 'size_t' (and any other unsigned types, BTW).
> See <https://google.github.io/styleguide/cppguide.html#Integer_Types>.
> Use the POSIX type 'ssize_t'.
> That also allows differentiating a length of 0 (i.e., "") from an invalid string (i.e., NULL), by returning -1 for NULL.
> 

https://man7.org/linux/man-pages/man3/strlen.3.html
size_t strlen(const char *s);

I'd rather not change the return type from POSIX size_t in any wrapper of strlen. Unless it is part of C11 Annex K style standards improvement.

Cheers, Jonny

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: strlen
  2021-07-09 10:50                         ` strlen Jonny Grant
@ 2021-07-09 11:27                           ` Alejandro Colomar (man-pages)
  2021-07-09 11:43                             ` strlen Alejandro Colomar (man-pages)
  0 siblings, 1 reply; 15+ messages in thread
From: Alejandro Colomar (man-pages) @ 2021-07-09 11:27 UTC (permalink / raw)
  To: Jonny Grant
  Cc: linux-man, Michael Kerrisk, Florian Weimer, gcc-help,
	Segher Boessenkool, Xi Ruoyao

Hi Jonny,

On 7/9/21 12:50 PM, Jonny Grant wrote:
> 
> 
> On 08/07/2021 12:06, Alejandro Colomar (man-pages) wrote:
>> On 7/8/21 12:07 PM, Jonny Grant wrote:
>>> Thank you for your reply.
>>>
>>> We can't guarantee safestrlen() won't be called with NULL. So because strlen() itself doesn't check for NULL in C standard we'd need to call the wrapper so that NULL can be checked for.
>>>
>>> I'd like to avoid the compiler removing certain execution paths.
>>> I'd rather keep all code paths, even if they are not taken, just in case a NULL pointer creeps in due to an external device that is connected to an embedded system.
>>>
>>>
>>> Probably this would work:
>>>
>>> size_t __attribute__((optimize("O0"))) safestrlen(const char * s)
>>> {
>>>      if (NULL == s) return 0;
>>>      else return strlen(s);
>>> }
>>
>> I don't think you don't need that.  Unless there's a bug in GCC, it shouldn't optimize that path unless it is 100% sure that it will never be called.
> 
> That is good, so the code will always be kept! As compiler will never find all calls to strlen() and be sure those calls are never NULL.

Not always.  If you inline that function, that path may be removed in
some calls, if the compiler knows better than you that it can.  My point
is that you shouldn't care; your code is completely legal, and whatever
the compiler decides to do will also be legal (no undefined behavior,
and no crashes).  If it optimizes, it will be a good thing that you
shouldn't prevent.

If the compiler does otherwise, that's a bug in the compiler, and
something you can't solve by writing different code or preventing
optimizations, or at least there's no guarantee about it, since it's a
bug.  But I don't believe you'll find a bug in this case.  So please,
trust the compiler, at least when using perfectly defined behavior.  And
if you don't trust the compiler, which is perfectly reasonable, test it,
but don't try to workaround bugs that don't exist.

> 
>> Moreover, I recommend you to optimize as much as possible.
>> Even though NULL is possible in your code, I guess it's unlikely.
>>
>> Also, calling a function safe is too generic.
>> I'd call it with the suffix null, as it act different on null.
>>
>> Also, I recommend avoiding 'size_t' (and any other unsigned types, BTW).
>> See <https://google.github.io/styleguide/cppguide.html#Integer_Types>.
>> Use the POSIX type 'ssize_t'.
>> That also allows differentiating a length of 0 (i.e., "") from an invalid string (i.e., NULL), by returning -1 for NULL.
>>
> 
> https://man7.org/linux/man-pages/man3/strlen.3.html
> size_t strlen(const char *s);
> 
> I'd rather not change the return type from POSIX size_t in any wrapper of strlen. Unless it is part of C11 Annex K style standards improvement.

That's a historical accident.

A long time ago (much before I was born, and much before the first
standard, I mean in the early times of K&R C and Unix), unsigned types
were used more than they should, and the first C standards (I mean
ANSI/ISO standards (i.e., C89, C99, ...), not POSIX), with a lot of
already existing code, didn't attempt to change the language, but to
annotate common usage.

In POSIX.1 there's a mix, because POSIX has the type 'ssize_t', which is
not defined by the C standards.  POSIX in general tends to use the
signed type 'ssize_t' for its POSIX-only functions (i.e., not in the C
standards).

Annex K has been an attempt of Microsoft to provide safer functions, but
while there are some functions there that have good intentions, most of
them are just badly designed.  That annex K is DOA, and will probably be
marked as deprecated in C22 (currently C2x).

I think that a standard should not try to design new functions, and
instead just annotate common usage, as they did in the first ones.
Problems like the ones Annex K suffers could have been detected early if
they had been implemented as an extension to some compiler(s) decade(s)
before being standardized.  Therefore, if the implementation passes the
test of time, you standardize it, else not, IMO.  Otherwise, we have a
standard that is declared deprecated in the next version of the
standard, similar to what is happening with the C++ standards (which,
guess what, BTW I recently read that they are undeprecating a lot of C
stuff they deprecated in the first standards).

Using unsigned for anything else than bitfileds and similar stuff is
just *wrong*, as you can read from the Google C++ style guide I linked
before.
Another source you can read is this paper from Bjarne Stroustrup:
<http://open-std.org/JTC1/SC22/WG21/docs/papers/2019/p1428r0.pdf>
This is one of the few cases where I agree with something coming from
him.  I hope some day we get a ssizeof operator :-)

Cheers,

Alex

-- 
Alejandro Colomar
Linux man-pages comaintainer; https://www.kernel.org/doc/man-pages/
http://www.alejandro-colomar.es/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: strlen
  2021-07-09 11:27                           ` strlen Alejandro Colomar (man-pages)
@ 2021-07-09 11:43                             ` Alejandro Colomar (man-pages)
  0 siblings, 0 replies; 15+ messages in thread
From: Alejandro Colomar (man-pages) @ 2021-07-09 11:43 UTC (permalink / raw)
  To: Jonny Grant
  Cc: linux-man, Michael Kerrisk, Florian Weimer, gcc-help,
	Segher Boessenkool, Xi Ruoyao



On 7/9/21 1:27 PM, Alejandro Colomar (man-pages) wrote:
> Annex K has been an attempt of Microsoft to provide safer functions, but
> while there are some functions there that have good intentions, most of
> them are just badly designed.  That annex K is DOA, and will probably be
> marked as deprecated in C22 (currently C2x).
> 
> I think that a standard should not try to design new functions, and
> instead just annotate common usage, as they did in the first ones.
> Problems like the ones Annex K suffers could have been detected early if
> they had been implemented as an extension to some compiler(s) decade(s)
> before being standardized.  Therefore, if the implementation passes the
> test of time, you standardize it, else not, IMO.  Otherwise, we have a
> standard that is declared deprecated in the next version of the
> standard, similar to what is happening with the C++ standards (which,
> guess what, BTW I recently read that they are undeprecating a lot of C
> stuff they deprecated in the first standards).

But let's analyze Annex K (C11):

It has been designed by Microsoft.

MS's compiler (MS Visual Studio) doesn't even fully support C99 yet (and
by that trend, I doubt it never will).  At most it supports C89.  Visual
Studio has a long history of not supporting C except for those parts
required to implement their C++ compiler.  Would you buy a car designed
by a bike manufacturer?


-- 
Alejandro Colomar
Linux man-pages comaintainer; https://www.kernel.org/doc/man-pages/
http://www.alejandro-colomar.es/

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-07-10  1:02 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <0bf239e9-cfc7-8889-ca00-b1d1566bd174@jguk.org>
     [not found] ` <87lfhpgxf8.fsf@mid.deneb.enyo.de>
     [not found]   ` <017a5a66-ba66-7cc8-c607-f851c2e54fc4@jguk.org>
     [not found]     ` <87363whf2z.fsf@mid.deneb.enyo.de>
     [not found]       ` <48e874cb-2b95-2783-ddfc-ae12d9aaf8f5@jguk.org>
     [not found]         ` <87bl7fozxi.fsf@mid.deneb.enyo.de>
     [not found]           ` <23679a28-5986-0af2-4d98-7de68ed0deec@jguk.org>
     [not found]             ` <53b3666b-d200-ef97-b050-cc38669c61cd@gmail.com>
     [not found]               ` <b6fccca1-6e2b-fb20-d9d6-9df94cd3f05f@gmail.com>
     [not found]                 ` <564825ed-1e1f-b344-da35-1b83c551ed5f@jguk.org>
     [not found]                   ` <b71170df-7c6b-4582-c3d1-84b811fe5259@gmail.com>
     [not found]                     ` <5566b180-1333-d73b-22ee-6c6d32053921@jguk.org>
2021-07-08 11:06                       ` strlen Alejandro Colomar (man-pages)
2021-07-08 12:05                         ` strlen johnsfine
2021-07-09  2:08                           ` strlen LIU Hao
2021-07-09 17:26                             ` strlen Martin Sebor
2021-07-09 19:51                               ` strlen johnsfine
2021-07-09 20:19                               ` strlen Alejandro Colomar (man-pages)
2021-07-08 12:13                         ` strlen Xi Ruoyao
2021-07-08 23:49                         ` strlen Segher Boessenkool
2021-07-09 13:54                           ` strlen Jonny Grant
2021-07-09 14:17                             ` strlen Alejandro Colomar (man-pages)
2021-07-09 16:11                               ` strlen Xi Ruoyao
2021-07-10  1:00                               ` strlen Segher Boessenkool
2021-07-09 10:50                         ` strlen Jonny Grant
2021-07-09 11:27                           ` strlen Alejandro Colomar (man-pages)
2021-07-09 11:43                             ` strlen Alejandro Colomar (man-pages)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).