Re: size_t vs long.

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Re: size_t vs long.
       [not found]       ` <27229b18-673b-d038-9a4c-c32c50ca547c@cs.ucla.edu>
@ 2022-11-17 23:04         ` Alejandro Colomar
  2022-11-23 20:08           ` Using size_t to crash on off-by-one errors (was: size_t vs long.) Alejandro Colomar
  0 siblings, 1 reply; 2+ messages in thread
From: Alejandro Colomar @ 2022-11-17 23:04 UTC (permalink / raw)
  To: Paul Eggert, A, libc-alpha; +Cc: gcc

[-- Attachment #1.1: Type: text/plain, Size: 9017 bytes --]

Hi Paul,

On 11/17/22 22:39, Paul Eggert wrote:
>>> Second and more important, that code is bogus. Nobody should ever write code 
>>> like that. If I wrote code like that, I'd *want* a trap.
>>
>> for (size_t i = 41; i < sizeof A / sizeof A[0]; --i) {
>>    A[i] = something_nice;
>> }
>>
>> The code above seems a bug by not being used to it.  Once you get used to it, 
>> it can become natural, but let's go for the more natural:
>>
>>
>> for (size_t i = 0; i < sizeof A / sizeof A[0]; ++i) {
>>    A[i] = something_nice;
>> } 
> 
> Those loops do not mean the same thing.

Sorry, I didn't mean that they are the same.  For code that means exactly the 
same as

for (size_t i = 41; i < nitems(A); --i) {
     A[i] = something_nice;
}

we need:

for (size_t i = 0; i < nitems(A) && i <= 41; ++i) {
     A[i] = something_nice;
}

or

for (idx_t i = 0; i < nitems(A) && i <= 41; ++i) {
     A[i] = something_nice;
}

or

for (idx_t i = 41; i < nitems(A) && i > 0; --i) {
     A[i] = something_nice;
}

(always assuming SIZE_MAX > INT_MAX.)

Without more context, I can't know if any of them are bogus.  Of course, that 41 
should normally come from a variable instead of a magic number.

We can see that Jens's code is at least simpler, which is a bonus point.

But normally, we don't enter a loop from a random entry value, but rather one of 
its extremes, which is what I showed in my alternative example.  Let's also show 
the alternative forms we can write it:

for (size_t i = 0; i < nitems(A); ++i) {
     A[i] = something_nice;
}

is equivalent to:

for (size_t i = nitems(A) - 1; i < nitems(A); --i) {
     A[i] = something_nice;
}

or

for (idx_t i = nitems(A) - 1; i > 0; --i) {
     A[i] = something_nice;
}

or

for (idx_t i = 0; i < nitems(A); ++i) {
     A[i] = something_nice;
}

There's not much difference in this case regarding readability.

If 'i' is modified within the loop (apart from the obvious --i_, then we need to 
check both bounds of the array, which with size_t comes for free; with idx_t you 
need to add code:

for (idx_t i = nitems(A) - 1; i < nitems(A) && i > 0; --i) {
     A[i] = something_nice;
     i += foo;
}

or

for (idx_t i = 0; i < nitems(A) && i > 0; ++i) {
     A[i] = something_nice;
     i += foo;
}

vs

for (size_t i = 0; i < nitems(A); ++i) {
     A[i] = something_nice;
     i += foo;
}

or

for (size_t i = nitems(A) - 1; i < nitems(A); --i) {
     A[i] = something_nice;
     i += foo;
}

Again, size_t seems to win in simplicity.

> The first is bogus; the second one is OK 
> (notice, the bogus loop has a "41", the OK loop doesn't).
> 
> I'm not surprised you didn't notice how bogus the first loop was - most people 
> wouldn't notice it either. And it's Gustedt's main point! I don't know why he 
> went off the rails with that overly-clever code, but he did.

I still don't know what was the intended bug.  Or why it would differ from the 
idx_t versions.  Please detail.

> 
> 
>> The main advantage of this code compared to the equivalent ssize_t or 
>> ptrdiff_t or idx_t code is that if you somehow write an off-by-one error, and 
>> manage to access the array at [-1], if i is unsigned you'll access [SIZE_MAX], 
>> which will definitely crash your program.
> 
> That's not true on the vast majority of today's platforms, which don't have 
> subscript checking, and for which a[-1] is treated the same way a[SIZE_MAX] is. 
> On my platform (Fedora 36 x86-64) the same machine code is generated for 'a' and 
> 'b' for the following C code.
> 
>    #include <stdint.h>
>    int a(int *p) { return p[-1]; }
>    int b(int *p) { return p[SIZE_MAX]; }

Hmm, this seems to be true in my platform (amd64) per the experiment I just did:

$ cat s.c
#include <sys/types.h>

char
f(char *p, ssize_t i)
{
	return p[i];
}
$ cat u.c
#include <stddef.h>

char
f(char *p, size_t i)
{
	return p[i];
}
$ cc -Wall -Wextra -Werror -S -O3 s.c u.c
$ diff -u u.s s.s
--- u.s	2022-11-17 23:41:47.773805041 +0100
+++ s.s	2022-11-17 23:41:47.761805265 +0100
@@ -1,15 +1,15 @@
-	.file	"u.c"
+	.file	"s.c"
  	.text
  	.p2align 4
  	.globl	f
  	.type	f, @function
  f:
-.LFB0:
+.LFB6:
  	.cfi_startproc
  	movzbl	(%rdi,%rsi), %eax
  	ret
  	.cfi_endproc
-.LFE0:
+.LFE6:
  	.size	f, .-f
  	.ident	"GCC: (Debian 12.2.0-9) 12.2.0"
  	.section	.note.GNU-stack,"",@progbits

It seems a violation of the standard, isn't it?

The operator [] doesn't have a type, and an argument to it should be treated 
with whatever type it has after default promotions.  If I pass a size_t to it, 
the type should be unsigned, and that should be preserved, by accessing the 
array at a high value, which the compiler has no way to know if it will exist or 
not, by that function definition.  The extreme of -1 and SIZE_MAX might be not 
the best one, since we would need a pointer to be 0 to be accessible at 
[SIZE_MAX], but if you replace those by -RANDOM, and (size_t)-RANDOM, then the 
compiler definitely needs to generate different code, yet it doesn't.

I'm guessing this is an optimization by GCC knowing that we will never be close 
to using the whole 64-bit address space.  If we use int and unsigned, things change:

$ cat s.c
char
f(char *p, int i)
{
	return p[i];
}
alx@asus5775:~/tmp$ cat u.c
char
f(char *p, unsigned i)
{
	return p[i];
}
$ cc -Wall -Wextra -Werror -S -O3 s.c u.c
$ diff -u u.s s.s
--- u.s	2022-11-17 23:44:54.446318186 +0100
+++ s.s	2022-11-17 23:44:54.434318409 +0100
@@ -1,4 +1,4 @@
-	.file	"u.c"
+	.file	"s.c"
  	.text
  	.p2align 4
  	.globl	f
@@ -6,7 +6,7 @@
  f:
  .LFB0:
  	.cfi_startproc
-	movl	%esi, %esi
+	movslq	%esi, %rsi
  	movzbl	(%rdi,%rsi), %eax
  	ret
  	.cfi_endproc

I'm guessing that GCC doesn't do the assumption here, and I guess the unsigned 
version would crash, while the signed version would cause nasal demons.  Anyway, 
now that I'm here, I'll test it:

$ cat s.c
[[gnu::noipa]]
char
f(char *p, int i)
{
	return p[i];
}

int main(void)
{
	int i = -1;
	char c[4];

	return f(c, i);
}
$ cc -Wall -Wextra -Werror -O3 s.c
$ ./a.out
$ echo $?
0

$ cat u.c
[[gnu::noipa]]
char
f(char *p, unsigned i)
{
	return p[i];
}

int main(void)
{
	unsigned i = -1;
	char c[4];

	return f(c, i);
}
$ cc -Wall -Wextra -Werror -O3 u.c
$ ./a.out
Segmentation fault

I get this SEGV difference consistently.  I CCed gcc@ in case they consider this 
to be something they want to address.  Maybe the optimization is important for 
size_t-sized indices, but if it is not, I'd prefer getting the SEGV for SIZE_MAX.

> 
> Yes, debugging implementations might catch p[SIZE_MAX], but the ones that do 
> will likely catch p[-1] as well.
> 
> In short, there's little advantage to using size_t for indexes, and there are 
> real disadvantages due to comparison confusion and lack of signed integer 
> overflow checking.
> 
> 
>>> First, Gustedt technically incorrect, because the code *can* trap on 
>>> platforms where SIZE_MAX <= INT_MAX,
> 
>> I honestly don't know of any existing platforms where that is true
> 
> They're a dying breed. The main problem from my point of view is that C and 
> POSIX allow these oddballs, so if you want to write really portable code you 
> have to worry about them - and this understadably discourages people from 
> writing really portable code. (What's the point of coding to the standards if 
> it's just a bunch of make-work?)

I understand your point, since you work on highly-portable code.

But there's always a tradeoff, and I'd very much like that the standards didn't 
allow such oddballs.  They're cleaning up with C23, so it seems to be going in 
the good direction.  I remember this discussion we had with ILP64.  I hope in 
the future those oddballs get reduced considerably.  Luckily, SIZE_MAX>INT_MAX 
seems to be quite more uncommon than ILP64.

> 
> Anyway, one example is Unisys Clearpath C, in which INT_MAX and SIZE_MAX both 
> equal 2**39 - 1.

Lol.  It would have been fun to see it be 2**42 - 1.

> This is allowed by the current POSIX and C standards, and this 
> compiler is still for sale and supported. (I doubt whether they'll port it to 
> C23, so there's that....)

Heh, I now don't remember the name of this compiler that attempted to implement 
UB in the most unexpected ways on purpose just for fun, which was itself an 
interesting experiment to check portability of a given code.  It reminds me of that.

> 
> 
>> C23 will require that signed integers are 2's complement, which I guess 
>> removes the possibility of a trap
> 
> It doesn't remove the possibility, since signed integers can have trap 
> representations. But we are straying from the more important point.
> 

Cheers,

Alex

-- 
<http://www.alejandro-colomar.es/>

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Using size_t to crash on off-by-one errors (was: size_t vs long.)
  2022-11-17 23:04         ` size_t vs long Alejandro Colomar
@ 2022-11-23 20:08           ` Alejandro Colomar
  0 siblings, 0 replies; 2+ messages in thread
From: Alejandro Colomar @ 2022-11-23 20:08 UTC (permalink / raw)
  To: Paul Eggert, libc-alpha; +Cc: gcc, A


[-- Attachment #1.1: Type: text/plain, Size: 4925 bytes --]

Hi,

On 11/18/22 00:04, Alejandro Colomar wrote:
>>> The main advantage of this code compared to the equivalent ssize_t or 
>>> ptrdiff_t or idx_t code is that if you somehow write an off-by-one error, and 
>>> manage to access the array at [-1], if i is unsigned you'll access 
>>> [SIZE_MAX], which will definitely crash your program.
>>
>> That's not true on the vast majority of today's platforms, which don't have 
>> subscript checking, and for which a[-1] is treated the same way a[SIZE_MAX] 
>> is. On my platform (Fedora 36 x86-64) the same machine code is generated for 
>> 'a' and 'b' for the following C code.
>>
>>    #include <stdint.h>
>>    int a(int *p) { return p[-1]; }
>>    int b(int *p) { return p[SIZE_MAX]; }
> 
> Hmm, this seems to be true in my platform (amd64) per the experiment I just did:
> 
> $ cat s.c
> #include <sys/types.h>
> 
> char
> f(char *p, ssize_t i)
> {
>      return p[i];
> }
> $ cat u.c
> #include <stddef.h>
> 
> char
> f(char *p, size_t i)
> {
>      return p[i];
> }
> $ cc -Wall -Wextra -Werror -S -O3 s.c u.c
> $ diff -u u.s s.s
> --- u.s    2022-11-17 23:41:47.773805041 +0100
> +++ s.s    2022-11-17 23:41:47.761805265 +0100
> @@ -1,15 +1,15 @@
> -    .file    "u.c"
> +    .file    "s.c"
>       .text
>       .p2align 4
>       .globl    f
>       .type    f, @function
>   f:
> -.LFB0:
> +.LFB6:
>       .cfi_startproc
>       movzbl    (%rdi,%rsi), %eax
>       ret
>       .cfi_endproc
> -.LFE0:
> +.LFE6:
>       .size    f, .-f
>       .ident    "GCC: (Debian 12.2.0-9) 12.2.0"
>       .section    .note.GNU-stack,"",@progbits
> 
> 
> It seems a violation of the standard, isn't it?
> 
> The operator [] doesn't have a type, and an argument to it should be treated 
> with whatever type it has after default promotions.  If I pass a size_t to it, 
> the type should be unsigned, and that should be preserved, by accessing the 
> array at a high value, which the compiler has no way to know if it will exist or 
> not, by that function definition.  The extreme of -1 and SIZE_MAX might be not 
> the best one, since we would need a pointer to be 0 to be accessible at 
> [SIZE_MAX], but if you replace those by -RANDOM, and (size_t)-RANDOM, then the 
> compiler definitely needs to generate different code, yet it doesn't.
> 
> I'm guessing this is an optimization by GCC knowing that we will never be close 
> to using the whole 64-bit address space.  If we use int and unsigned, things 
> change:
> 
> $ cat s.c
> char
> f(char *p, int i)
> {
>      return p[i];
> }
> alx@asus5775:~/tmp$ cat u.c
> char
> f(char *p, unsigned i)
> {
>      return p[i];
> }
> $ cc -Wall -Wextra -Werror -S -O3 s.c u.c
> $ diff -u u.s s.s
> --- u.s    2022-11-17 23:44:54.446318186 +0100
> +++ s.s    2022-11-17 23:44:54.434318409 +0100
> @@ -1,4 +1,4 @@
> -    .file    "u.c"
> +    .file    "s.c"
>       .text
>       .p2align 4
>       .globl    f
> @@ -6,7 +6,7 @@
>   f:
>   .LFB0:
>       .cfi_startproc
> -    movl    %esi, %esi
> +    movslq    %esi, %rsi
>       movzbl    (%rdi,%rsi), %eax
>       ret
>       .cfi_endproc
> 
> 
> I'm guessing that GCC doesn't do the assumption here, and I guess the unsigned 
> version would crash, while the signed version would cause nasal demons.  Anyway, 
> now that I'm here, I'll test it:
> 
> 
> $ cat s.c
> [[gnu::noipa]]
> char
> f(char *p, int i)
> {
>      return p[i];
> }
> 
> int main(void)
> {
>      int i = -1;
>      char c[4];
> 
>      return f(c, i);
> }
> $ cc -Wall -Wextra -Werror -O3 s.c
> $ ./a.out
> $ echo $?
> 0
> 
> 
> $ cat u.c
> [[gnu::noipa]]
> char
> f(char *p, unsigned i)
> {
>      return p[i];
> }
> 
> int main(void)
> {
>      unsigned i = -1;
>      char c[4];
> 
>      return f(c, i);
> }
> $ cc -Wall -Wextra -Werror -O3 u.c
> $ ./a.out
> Segmentation fault
> 
> 
> I get this SEGV difference consistently.  I CCed gcc@ in case they consider this 
> to be something they want to address.  Maybe the optimization is important for 
> size_t-sized indices, but if it is not, I'd prefer getting the SEGV for SIZE_MAX.
> 

After some though, of course the compiler can't produce any different code, 
since pointers are 64 bits.  A different story would be if pointers were 128 
bits, but that might cause its own issues; should sizes be still 64 bits? or 128 
bits?  Maybe using a configurable size_t would be interesting for debugging.

Anyway, it's good to know that tweaking size_t to be 32 bits in some debug 
builds might help catch some off-by-one errors.

Cheers,

Alex

-- 
<http://www.alejandro-colomar.es/>

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2022-11-23 20:08 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAOM0=dac1uM+96yT7f_GDp1f0bF=oW2JoDPFPPOgd4OjzRR+mA@mail.gmail.com>
     [not found] ` <c139d396-4084-f8c0-44e8-31d20d171056@gmail.com>
     [not found]   ` <dd16db9e-bdfe-901d-9b9f-c0aa2836e55e@cs.ucla.edu>
     [not found]     ` <380b196e-b78e-3b0e-7399-ee106b0e716c@gmail.com>
     [not found]       ` <27229b18-673b-d038-9a4c-c32c50ca547c@cs.ucla.edu>
2022-11-17 23:04         ` size_t vs long Alejandro Colomar
2022-11-23 20:08           ` Using size_t to crash on off-by-one errors (was: size_t vs long.) Alejandro Colomar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).