public inbox for gcc-help@gcc.gnu.org
 help / color / mirror / Atom feed
* Assignment of union containing const-qualifier member
@ 2024-01-30 21:45 Alejandro Colomar
  2024-01-31 18:14 ` Alejandro Colomar
  0 siblings, 1 reply; 12+ messages in thread
From: Alejandro Colomar @ 2024-01-30 21:45 UTC (permalink / raw)
  To: gcc-help

[-- Attachment #1: Type: text/plain, Size: 1564 bytes --]

Hi,

I'm trying to do something like the following:

	$ cat const.c 
	union u {
		int        a;
		const int  b;
	};

	int
	main(void)
	{
		union u  u, v;

		u.a = 42;
		v = u;
	}
	$ cc -Wall -Wextra const.c 
	const.c: In function ‘main’:
	const.c:12:11: error: assignment of read-only variable ‘v’
	   12 |         v = u;
	      |           ^
	const.c:9:21: warning: variable ‘v’ set but not used [-Wunused-but-set-variable]
	    9 |         union u  u, v;
	      |                     ^

The actual data I'm using is not just an int, but that serves to
reproduce the problem easily.  In reality, the union is more like this:

	struct rstr {
	    const size_t       length;
	    const char *const  start;
	};

	union str {
	    struct {
		size_t         length;
		char           *start;
	    } w;
	    struct rstr        r;
	};

I don't see anywhere in C11 that makes this a constraint violation, and
considering that the const member is fully overlapped by a non-const
member, I don't see why this assignment would be bad.  I couldn't find
what's the exact semantics of assignment to a union, but I guess it's
similar to initialization, and since the first member is non-const, this
should be doable.  Maybe it's a bit of unspecified behavior, precisely
because nothing seems to specify it.

Is that really a constraint violation, or should the compiler accept the
code?

Have a lovely day,
Alex

-- 
<https://www.alejandro-colomar.es/>
Looking for a remote C programming job at the moment.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Assignment of union containing const-qualifier member
  2024-01-30 21:45 Assignment of union containing const-qualifier member Alejandro Colomar
@ 2024-01-31 18:14 ` Alejandro Colomar
  2024-02-04  7:33   ` Amol Surati
  0 siblings, 1 reply; 12+ messages in thread
From: Alejandro Colomar @ 2024-01-31 18:14 UTC (permalink / raw)
  To: gcc-help

[-- Attachment #1: Type: text/plain, Size: 2311 bytes --]

On Tue, Jan 30, 2024 at 10:45:11PM +0100, Alejandro Colomar wrote:
> Hi,
> 
> I'm trying to do something like the following:
> 
> 	$ cat const.c 
> 	union u {
> 		int        a;
> 		const int  b;
> 	};
> 
> 	int
> 	main(void)
> 	{
> 		union u  u, v;
> 
> 		u.a = 42;
> 		v = u;
> 	}
> 	$ cc -Wall -Wextra const.c 
> 	const.c: In function ‘main’:
> 	const.c:12:11: error: assignment of read-only variable ‘v’
> 	   12 |         v = u;
> 	      |           ^
> 	const.c:9:21: warning: variable ‘v’ set but not used [-Wunused-but-set-variable]
> 	    9 |         union u  u, v;
> 	      |                     ^
> 
> The actual data I'm using is not just an int, but that serves to
> reproduce the problem easily.  In reality, the union is more like this:
> 
> 	struct rstr {
> 	    const size_t       length;
> 	    const char *const  start;
> 	};
> 
> 	union str {
> 	    struct {
> 		size_t         length;
> 		char           *start;
> 	    } w;
> 	    struct rstr        r;
> 	};
> 
> I don't see anywhere in C11 that makes this a constraint violation, and

Ahh, I didn't find it.  It's specified undex 6.3 (Conversions),
6.3.2.1 (Lvalues, arrays, and function designators)
<http://port70.net/~nsz/c/c11/n1570.html#6.3.2.1>:

< A modifiable lvalue is an lvalue that does not have array type, does
< not have an incomplete type, does not have a const- qualified type,
< and if it is a structure or union, does not have any member
< (including, recursively, any member or element of all contained
< aggregates or unions) with a const- qualified type.

I'm wondering if this is a bit conservative, and a union like this could
be a modifiable lvalue.  I find it useful, since it allows inheriting a
non-modifiable version of the string which cannot be made modifiable
again, but would let you edit the string as long as you keep the union.

I added the named member 'w' to allow copying v.w = u.w, but when the
union is part of a larger structure and I need to copy the entire
structure, that doesn't help.  memcpy(3) does help, but it looses all
type safety.

Maybe this could be allowed as an extension.  Any thoughts?

Cheers,
Alex

-- 
<https://www.alejandro-colomar.es/>
Looking for a remote C programming job at the moment.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Assignment of union containing const-qualifier member
  2024-01-31 18:14 ` Alejandro Colomar
@ 2024-02-04  7:33   ` Amol Surati
  2024-02-04  8:18     ` Amol Surati
  2024-02-04 18:40     ` Alejandro Colomar
  0 siblings, 2 replies; 12+ messages in thread
From: Amol Surati @ 2024-02-04  7:33 UTC (permalink / raw)
  To: Alejandro Colomar; +Cc: gcc-help

On Wed, 31 Jan 2024 at 23:46, Alejandro Colomar via Gcc-help
<gcc-help@gcc.gnu.org> wrote:
>
> On Tue, Jan 30, 2024 at 10:45:11PM +0100, Alejandro Colomar wrote:
> > Hi,
> >

[ ... ]

> structure, that doesn't help.  memcpy(3) does help, but it looses all
> type safety.
>
> Maybe this could be allowed as an extension.  Any thoughts?
>

Does it make sense to propose that, if the first top-level member of a
union is completely (i.e. recursively) writable, then a non-const union
object as a whole is writable? If so, then, for union objects a and b of
a union that has such const members, a = b can be expected to not
raise errors about const-correctness.

It seems that a union only provides a view of the object. The union
object doesn't automatically become const qualified if a member
of the union is const-qualified. This seems to be the reason v.w = u.w
works; otherwise, that modification can also be viewed as the
modification of an object (v.r) defined with a const-qualified type through
the use of an lvalue (v.w) with non-const-qualified type - something that's
forbidden by the std.

More towards the use of the string as described:
If there are multiple such union objects that point to the same string,
and if a piece of code decides to modify the string, other consumers of
this string remain unaware of the modification, unless they check for it,
for e.g., by keeping a copy, calc. hash, etc., to ensure that the string was
indeed not silently modified behind their backs.

I think it is better to have a 'class' and associated APIs.
See [1], for e.g., or the implementation of c++ std::string.

The ownership of an object of such a class can be passed by passing
a non-const pointer to the object.

Functions that are not supposed to own the object can be passed a
const pointer. Despite that, if such functions need to modify it for local
needs, they can create a copy to work with.

One can additionally maintain a ref-count on the char pointer, to avoid
having to unnecessarily copy a string if it is going to be placed in several
stay-resident-after-return data-structures.

-Amol

[1] https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3210.pdf

> Cheers,
> Alex
>
> --
> <https://www.alejandro-colomar.es/>
> Looking for a remote C programming job at the moment.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Assignment of union containing const-qualifier member
  2024-02-04  7:33   ` Amol Surati
@ 2024-02-04  8:18     ` Amol Surati
  2024-02-04 18:40     ` Alejandro Colomar
  1 sibling, 0 replies; 12+ messages in thread
From: Amol Surati @ 2024-02-04  8:18 UTC (permalink / raw)
  To: Alejandro Colomar; +Cc: gcc-help

On Sun, 4 Feb 2024 at 13:03, Amol Surati <suratiamol@gmail.com> wrote:
>
> On Wed, 31 Jan 2024 at 23:46, Alejandro Colomar via Gcc-help
> <gcc-help@gcc.gnu.org> wrote:
> >
> > On Tue, Jan 30, 2024 at 10:45:11PM +0100, Alejandro Colomar wrote:
> > > Hi,
> > >
>
> [ ... ]
>
> > structure, that doesn't help.  memcpy(3) does help, but it looses all
> > type safety.
> >
> > Maybe this could be allowed as an extension.  Any thoughts?
> >
>
> Does it make sense to propose that, if the first top-level member of a
> union is completely (i.e. recursively) writable, then a non-const union
> object as a whole is writable? If so, then, for union objects a and b of
> a union that has such const members, a = b can be expected to not
> raise errors about const-correctness.

This doesn't look okay- what if the last written-to member was not the
first member of the union? The assignment can either copy the full
size of the union (i.e. sizeof (b)) or copy just the member that was
last written to within b. The latter requires the compiler to maintain
an 'active' member of unions, which I believe isn't expected from the
compilers.

I believe the compiler performs an assignment a=b through
memcpy (or through statements that have similar effects as that of
memcpy) that, as you noticed, ignores type-safety. It may be because
of such obstacles that the presence of a const member in the union
prevents a union object from being a modifiable lvalue.

>
> It seems that a union only provides a view of the object. The union
> object doesn't automatically become const qualified if a member
> of the union is const-qualified. This seems to be the reason v.w = u.w
> works; otherwise, that modification can also be viewed as the
> modification of an object (v.r) defined with a const-qualified type through
> the use of an lvalue (v.w) with non-const-qualified type - something that's
> forbidden by the std.
>
> More towards the use of the string as described:
> If there are multiple such union objects that point to the same string,
> and if a piece of code decides to modify the string, other consumers of
> this string remain unaware of the modification, unless they check for it,
> for e.g., by keeping a copy, calc. hash, etc., to ensure that the string was
> indeed not silently modified behind their backs.
>
> I think it is better to have a 'class' and associated APIs.
> See [1], for e.g., or the implementation of c++ std::string.
>
> The ownership of an object of such a class can be passed by passing
> a non-const pointer to the object.
>
> Functions that are not supposed to own the object can be passed a
> const pointer. Despite that, if such functions need to modify it for local
> needs, they can create a copy to work with.
>
> One can additionally maintain a ref-count on the char pointer, to avoid
> having to unnecessarily copy a string if it is going to be placed in several
> stay-resident-after-return data-structures.
>
> -Amol
>
> [1] https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3210.pdf
>
> > Cheers,
> > Alex
> >
> > --
> > <https://www.alejandro-colomar.es/>
> > Looking for a remote C programming job at the moment.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Assignment of union containing const-qualifier member
  2024-02-04  7:33   ` Amol Surati
  2024-02-04  8:18     ` Amol Surati
@ 2024-02-04 18:40     ` Alejandro Colomar
  2024-02-04 20:37       ` Alejandro Colomar
  2024-02-07  4:13       ` Amol Surati
  1 sibling, 2 replies; 12+ messages in thread
From: Alejandro Colomar @ 2024-02-04 18:40 UTC (permalink / raw)
  To: Amol Surati; +Cc: gcc-help

[-- Attachment #1: Type: text/plain, Size: 8760 bytes --]

Hi Amol,

On Sun, Feb 04, 2024 at 01:03:48PM +0530, Amol Surati wrote:
> On Wed, 31 Jan 2024 at 23:46, Alejandro Colomar via Gcc-help
> <gcc-help@gcc.gnu.org> wrote:
> >
> > On Tue, Jan 30, 2024 at 10:45:11PM +0100, Alejandro Colomar wrote:
> > > Hi,
> > >
> 
> [ ... ]
> 
> > structure, that doesn't help.  memcpy(3) does help, but it looses all
> > type safety.
> >
> > Maybe this could be allowed as an extension.  Any thoughts?
> >
> 
> Does it make sense to propose that, if the first top-level member of a
> union is completely (i.e. recursively) writable, then a non-const union
> object as a whole is writable? If so, then, for union objects a and b of
> a union that has such const members, a = b can be expected to not
> raise errors about const-correctness.

To have a specific proposal, I'll specify it as a diff of ISO C11:

	$ diff -u c11 suggestion 
	--- c11	2024-02-04 19:37:27.520851005 +0100
	+++ suggestion	2024-02-04 19:38:56.785402567 +0100
	@@ -8,8 +8,8 @@
	 does not have array type,
	 does not have an incomplete type,
	 does not have a const- qualified type,
	-and if it is a structure or union,
	-does not have any member
	+and if it is a structure does not have any member,
	+or if it is a union does not have all members,
	 (including, recursively,
	 any member or element of all contained aggregates or unions)
	 with a const- qualified type.

(Modifying <http://port70.net/~nsz/c/c11/n1570.html#6.3.2.1p1>)

> 
> It seems that a union only provides a view of the object. The union
> object doesn't automatically become const qualified if a member
> of the union is const-qualified. This seems to be the reason v.w = u.w
> works; otherwise, that modification can also be viewed as the
> modification of an object (v.r) defined with a const-qualified type through
> the use of an lvalue (v.w) with non-const-qualified type - something that's
> forbidden by the std.

Modifying a union via a non-const member is fine in C, I believe.  I
think you're creating a new object, and discarding the old one, so you
don't need to care if there was an old object defined via a
const-qualified type.  That is, the following code is valid C, AFAIK:

	alx@debian:~/tmp$ cat u.c 
	union u {
		int        a;
		const int  b;
	};

	int
	main(void)
	{
		union u  u = {.b = 42};

		u.a = 7;
		return u.b;
	}
	alx@debian:~/tmp$ gcc-14 -Wall -Wextra u.c 
	alx@debian:~/tmp$ ./a.out ; echo $?
	7
	alx@debian:~/tmp$ clang-17 -Weverything u.c 
	alx@debian:~/tmp$ ./a.out ; echo $?
	7

> More towards the use of the string as described:
> If there are multiple such union objects that point to the same string,
> and if a piece of code decides to modify the string, other consumers of
> this string remain unaware of the modification, unless they check for it,
> for e.g., by keeping a copy, calc. hash, etc., to ensure that the string was
> indeed not silently modified behind their backs.

`const` only guarantees that an object is not modified through that
pointer.  As long as you keep another pointer to the same object, it can
be modified via that other pointer.  To guarantee that an object is
really constant --at least for what concerns a function--, you need to
also specify `restrict`.  If you have a `const type* restrict`, then you
know for sure it is constant, as far as the current function is
concerned.

If you're worried about multi-threaded programs, well, unions aren't any
more problematic here than passing a `const T *restrict` to a function,
and modifying it in another thread via a non-const lvalue.  As long as
the original object wasn't const, that's fair game.  It's the
programmer's task to make sure the functions behave well if that can
happen.

> 
> I think it is better to have a 'class' and associated APIs.

But we can't have that in C.

> See [1], for e.g., or the implementation of c++ std::string.
> 
> The ownership of an object of such a class can be passed by passing
> a non-const pointer to the object.
> 
> Functions that are not supposed to own the object can be passed a
> const pointer. Despite that, if such functions need to modify it for local
> needs, they can create a copy to work with.
> 
> One can additionally maintain a ref-count on the char pointer, to avoid
> having to unnecessarily copy a string if it is going to be placed in several
> stay-resident-after-return data-structures.

I normally prefer simple C strings, with a simple pointer.  The reason
I'm using this struct+union is performance.  In nginx, to reduce memory
consumption (you can get substrings by copying a pointer and specifying
a length), and also avoid calculating lengths of strings more than once,
we use these structures.

So far, we were using a simple struct:

	typedef struct string {
		size_t  length;
		u_char  *start;
	} string_t;  // it has a different name, but let's keep it simple

But that means we basically can't use `const` at all with our strings.
Because if you specify

	void foo(const string_t *str);

that means that you can't modify the pointer, but you can actually
modify the pointee.  Which means that you can't guarantee that a string
isn't corrupted after some call, unless you inspect all the code that
the function calls, recursively.

I started working on a way to improve these strings around a year ago,
and have recently come up with something.

> 
> -Amol
> 
> [1] https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3210.pdf

Maybe you can get something from what I've learnt with strings in Nginx,
since they're quite close to what that proposal has.

The main concern I have with that proposal is the same concern I've had
with strings in Nginx so far: you can't really make them `const`.
Unless you make the type opaque, and only provide accessors via
functions that protect the strings even if they could modify them.

You can only make them const, if you use two distinct types: a read-only
version, let's call it rstring, and a read-write version, let's call it
string.

	struct rstring_s {
	    size_t                        length;
	    const char                    *start;
	};


	union nxt_string_u {
	    struct {
		size_t                    length;
		char                      *start;
	    };
	    struct {
		size_t                    length;
		char                      *start;
	    } w;
	    const rstring_t               r;
	};

In Nginx we have another complexity: we don't necessarily terminate our
strings: this allows getting a substring in the middle of another string
without needing to make an actual copy of the memory.  But then it means
we need more types to have type safety.  I haven't finished developing
that, so I can't tell you if the code below does work, but this is what
I'm really working with at the moment:

	struct nxt_rstr_s {
	    size_t                          length;
	    const u_char                    *start;
	};


	union nxt_str_u {
	    struct {
		size_t                      length;
		u_char                      *start;
	    };
	    struct {
		size_t                      length;
		u_char                      *start;
	    } w;
	    const nxt_rstr_t                r;
	};


	union nxt_rstrz_u {
	    struct {
		size_t                      length;
		union {
		    const u_char            *start;
		    const char              *cstrz;
		};
	    };
	    struct {
		size_t                      length;
		const u_char                *start;
	    } w;
	    const nxt_rstr_t                r;
	};


	union nxt_strz_u {
	    struct {
		size_t                      length;
		union {
		    u_char                  *start;
		    char                    *cstrz;
		};
	    };
	    struct {
		size_t                      length;
		u_char                      *start;
	    } w;
	    const nxt_rstr_t                r;
	    const nxt_rstrz_t               rz;
	};

Structures `***z` contain null-terminated strings, while the other ones
don't.  You can read terminated strings as non-terminated ones, but not
the other way.  And you can access writable strings as read-only
strings, but not the other way around.

(We use `u_char` to avoid the problems that `char` has due to its
ambiguous sign; I would personally prefer using -funsigned-char, but
that's what it is, for historic reasons.)
Anyway, that `u_char` makes sure we don't mix our strings with libc
calls accidentally, and I only provide the `cstrz` member in unions that
actually provide a libc-compatible string view.


Have a lovely day,
Alex

-- 
<https://www.alejandro-colomar.es/>
Looking for a remote C programming job at the moment.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Assignment of union containing const-qualifier member
  2024-02-04 18:40     ` Alejandro Colomar
@ 2024-02-04 20:37       ` Alejandro Colomar
  2024-02-07  4:13       ` Amol Surati
  1 sibling, 0 replies; 12+ messages in thread
From: Alejandro Colomar @ 2024-02-04 20:37 UTC (permalink / raw)
  To: Amol Surati; +Cc: gcc-help

[-- Attachment #1: Type: text/plain, Size: 2524 bytes --]

On Sun, Feb 04, 2024 at 07:40:23PM +0100, Alejandro Colomar wrote:
> > It seems that a union only provides a view of the object. The union
> > object doesn't automatically become const qualified if a member
> > of the union is const-qualified. This seems to be the reason v.w = u.w
> > works; otherwise, that modification can also be viewed as the
> > modification of an object (v.r) defined with a const-qualified type through
> > the use of an lvalue (v.w) with non-const-qualified type - something that's
> > forbidden by the std.
> 
> Modifying a union via a non-const member is fine in C, I believe.  I
> think you're creating a new object, and discarding the old one, so you
> don't need to care if there was an old object defined via a
> const-qualified type.  That is, the following code is valid C, AFAIK:
> 
> 	alx@debian:~/tmp$ cat u.c 
> 	union u {
> 		int        a;
> 		const int  b;
> 	};
> 
> 	int
> 	main(void)
> 	{
> 		union u  u = {.b = 42};
> 
> 		u.a = 7;
> 		return u.b;
> 	}
> 	alx@debian:~/tmp$ gcc-14 -Wall -Wextra u.c 
> 	alx@debian:~/tmp$ ./a.out ; echo $?
> 	7
> 	alx@debian:~/tmp$ clang-17 -Weverything u.c 
> 	alx@debian:~/tmp$ ./a.out ; echo $?
> 	7

Although I would have doubts about what happens if the union contains
an aggregate, and you modify part of it:

	union u {
		int        a[2];
		const int  b[2];
	};

	int
	main(void)
	{
		union u  u = {.b = {42, 13}};

		u.a[0] = 7;
		return u.b[0];
	}

or

	union u {
		struct {
			int        a0;
			int        a1;
		} a;
		const struct {
			const int  b0;
			const int  b1;
		} b;
	};

	int
	main(void)
	{
		union u  u = {.b = {42, 13}};

		u.a.a0 = 7;
		return u.b.b0;
	}

In these cases, we're mnodifying a const object (.b), via a non-const
lvalue (.a.a0, and .a.a1).  We're modifying it, because we keep the
original value at .b.b1, but the new one at .b.b0, so in this case we
can't say it's an entire new object.

Is this legal?  Both GCC and Clang accept it, but I have doubts.
Maybe you could warn if the program sets a union via a const member when
there are non-const members, because it seems like something wrong to
do.

But as long as you don't set the const member, you should be fine
according to ISO C.  That is, if you initialize the union via .a, then
anything is valid, as the object is defined non-const.

Have a lovely night,
Alex

-- 
<https://www.alejandro-colomar.es/>
Looking for a remote C programming job at the moment.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Assignment of union containing const-qualifier member
  2024-02-04 18:40     ` Alejandro Colomar
  2024-02-04 20:37       ` Alejandro Colomar
@ 2024-02-07  4:13       ` Amol Surati
  2024-02-07 13:29         ` Alejandro Colomar
  1 sibling, 1 reply; 12+ messages in thread
From: Amol Surati @ 2024-02-07  4:13 UTC (permalink / raw)
  To: Alejandro Colomar; +Cc: gcc-help

Hi Alex,


On Mon, 5 Feb 2024 at 00:10, Alejandro Colomar <alx@kernel.org> wrote:
>
> Hi Amol,
>
> On Sun, Feb 04, 2024 at 01:03:48PM +0530, Amol Surati wrote:
> > On Wed, 31 Jan 2024 at 23:46, Alejandro Colomar via Gcc-help
> > <gcc-help@gcc.gnu.org> wrote:
> > >
> > > On Tue, Jan 30, 2024 at 10:45:11PM +0100, Alejandro Colomar wrote:
> > > > Hi,
> > > >
> >
> > [ ... ]
> >
> > > structure, that doesn't help.  memcpy(3) does help, but it looses all
> > > type safety.
> > >
> > > Maybe this could be allowed as an extension.  Any thoughts?
> > >
> >
> > Does it make sense to propose that, if the first top-level member of a
> > union is completely (i.e. recursively) writable, then a non-const union
> > object as a whole is writable? If so, then, for union objects a and b of
> > a union that has such const members, a = b can be expected to not
> > raise errors about const-correctness.
>
> To have a specific proposal, I'll specify it as a diff of ISO C11:
>
>         $ diff -u c11 suggestion
>         --- c11 2024-02-04 19:37:27.520851005 +0100
>         +++ suggestion  2024-02-04 19:38:56.785402567 +0100
>         @@ -8,8 +8,8 @@
>          does not have array type,
>          does not have an incomplete type,
>          does not have a const- qualified type,
>         -and if it is a structure or union,
>         -does not have any member
>         +and if it is a structure does not have any member,
>         +or if it is a union does not have all members,

As mentioned in my other reply, the assignment from one object to
another (of the same union type that has a const member) will have
to copy the size of the largest member, regardless of which member
was active the last time the source object was written to. Such a
copy, I believe, is a sort of memcpy that the compiler inserts. And
as you had already noticed before, memcpy loses all type-safety.
These might be the reasons the std flags an error when such union
objects are assigned to - assignment implies type-erasing memcpy.

>          (including, recursively,
>          any member or element of all contained aggregates or unions)
>          with a const- qualified type.
>
> (Modifying <http://port70.net/~nsz/c/c11/n1570.html#6.3.2.1p1>)

[ ... ]

> Modifying a union via a non-const member is fine in C, I believe.  I

That's true, but as you noted in your other reply, unions allow diluting
the type-safety to a great extent.

Linux kernel has a piece of code that does modify such a union
although it possesses decorations for the sparse semantic checker.
In its essence, it is similar to:

union {
    const int flags;
    int __attribute__ ((noderef)) __flags;
};
See https://sparse.docs.kernel.org/en/latest/annotations.html#noderef

[ ... ]

> > I think it is better to have a 'class' and associated APIs.
>
> But we can't have that in C.

By 'class', I meant a rigid way of accessing the struct strictly through the
provided APIs.

[ ... ]

> The main concern I have with that proposal is the same concern I've had
> with strings in Nginx so far: you can't really make them `const`.
> Unless you make the type opaque, and only provide accessors via
> functions that protect the strings even if they could modify them.

The type need not be opaque, to allow for optimizations. But the users of
the type must behave as if the type were opaque. The bulk of the API can
be 'static inline' functions within the header, to allow for optimizations and
inlining within a single translation unit. There's also LTO and thin-LTOs to
carry out optimizations across translation units, if needed.

>
> You can only make them const, if you use two distinct types: a read-only
> version, let's call it rstring, and a read-write version, let's call it
> string.

But do we need both views (r/w and r/o) at the same time?

>
>         struct rstring_s {
>             size_t                        length;
>             const char                    *start;
>         };
>
>
>         union nxt_string_u {
>             struct {
>                 size_t                    length;
>                 char                      *start;
>             };
>             struct {
>                 size_t                    length;
>                 char                      *start;
>             } w;
>             const rstring_t               r;
>         };
>
> In Nginx we have another complexity: we don't necessarily terminate our
> strings: this allows getting a substring in the middle of another string
> without needing to make an actual copy of the memory.  But then it means
> we need more types to have type safety.  I haven't finished developing
> that, so I can't tell you if the code below does work, but this is what
> I'm really working with at the moment:
>
>         struct nxt_rstr_s {
>             size_t                          length;
>             const u_char                    *start;
>         };
>
>
>         union nxt_str_u {
>             struct {
>                 size_t                      length;
>                 u_char                      *start;
>             };
>             struct {
>                 size_t                      length;
>                 u_char                      *start;
>             } w;
>             const nxt_rstr_t                r;
>         };
>
>
>         union nxt_rstrz_u {
>             struct {
>                 size_t                      length;
>                 union {
>                     const u_char            *start;
>                     const char              *cstrz;
>                 };
>             };
>             struct {
>                 size_t                      length;
>                 const u_char                *start;
>             } w;
>             const nxt_rstr_t                r;
>         };
>
>
>         union nxt_strz_u {
>             struct {
>                 size_t                      length;
>                 union {
>                     u_char                  *start;
>                     char                    *cstrz;
>                 };
>             };
>             struct {
>                 size_t                      length;
>                 u_char                      *start;
>             } w;
>             const nxt_rstr_t                r;
>             const nxt_rstrz_t               rz;
>         };
>
> Structures `***z` contain null-terminated strings, while the other ones
> don't.  You can read terminated strings as non-terminated ones, but not
> the other way.  And you can access writable strings as read-only
> strings, but not the other way around.
>
> (We use `u_char` to avoid the problems that `char` has due to its
> ambiguous sign; I would personally prefer using -funsigned-char, but
> that's what it is, for historic reasons.)
> Anyway, that `u_char` makes sure we don't mix our strings with libc
> calls accidentally, and I only provide the `cstrz` member in unions that
> actually provide a libc-compatible string view.

Passing these structs+unions is still passing the string as r/w. It does
not seem too far away from passing non-const pointers and sizes.

In a sort of a project of mine, I had to implement a buffer that can be used
to store ref-counted (non-terminated) strings and binary data. I did initially
create two different types, buf_rw and buf_ro, but then scrapped that for
what I have now, a skeleton of which is shown below:

struct string {
    size_t size;
    const char *data;
};

data[i] is now marked as non-modifiable.

Passing 'const struct string *' to a function implies that the function cannot
modify (in addition to data[i]), the fields size and data, through the
passed ptr.

A function that is passed 'struct string *' can still be allowed to
inplace-modify data[i] by invoking appropriate string APIs that temporarily
casts away the constness of data[i].

The SEI CERT coding standard for C allows for an exception for such
modifications. See
https://wiki.sei.cmu.edu/confluence/display/c/EXP05-C.+Do+not+cast+away+a+const+qualification#:~:text=end%20is%20undefined%20*/-,EXP05%2DC%2DEX3,-%3A%20Because%20const

Some other options:
Make use of the sparse semantic checker.

It might help to name the fields with underscores to highlight the fact that
they are internal fields.  For e.g. 'str_size__' and 'str_data__'. Any use of
these fields outside of the dedicated API functions can now be easily
traced by simple grep searches.

It might also help to create a visible type such as

struct string {
    const uintptr_t raw[2];
};

and let the API manage the translation between uintptr_t and size_t/char *.

-Amol

[1] https://wiki.sei.cmu.edu/confluence/display/c/EXP05-C.+Do+not+cast+away+a+const+qualification#:~:text=end%20is%20undefined%20*/-,EXP05%2DC%2DEX3,-%3A%20Because%20const




>
> Have a lovely day,
> Alex
>
> --
> <https://www.alejandro-colomar.es/>
> Looking for a remote C programming job at the moment.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Assignment of union containing const-qualifier member
  2024-02-07  4:13       ` Amol Surati
@ 2024-02-07 13:29         ` Alejandro Colomar
  2024-02-12 10:45           ` Amol Surati
  0 siblings, 1 reply; 12+ messages in thread
From: Alejandro Colomar @ 2024-02-07 13:29 UTC (permalink / raw)
  To: Amol Surati; +Cc: gcc-help

[-- Attachment #1: Type: text/plain, Size: 12172 bytes --]

Hi Amol,

On Wed, Feb 07, 2024 at 09:43:55AM +0530, Amol Surati wrote:
> > To have a specific proposal, I'll specify it as a diff of ISO C11:
> >
> >         $ diff -u c11 suggestion
> >         --- c11 2024-02-04 19:37:27.520851005 +0100
> >         +++ suggestion  2024-02-04 19:38:56.785402567 +0100
> >         @@ -8,8 +8,8 @@
> >          does not have array type,
> >          does not have an incomplete type,
> >          does not have a const- qualified type,
> >         -and if it is a structure or union,
> >         -does not have any member
> >         +and if it is a structure does not have any member,
> >         +or if it is a union does not have all members,
> 
> As mentioned in my other reply, the assignment from one object to
> another (of the same union type that has a const member) will have
> to copy the size of the largest member, regardless of which member
> was active the last time the source object was written to. Such a
> copy, I believe, is a sort of memcpy that the compiler inserts. And
> as you had already noticed before, memcpy loses all type-safety.

The fact that the compiler does the equivalent of memcpy(3) doesn't
imply a removal of type safety.

The copiler is aware of the union type at the assignment, and so can
emit diagnostics if anything is wrong.

When assigning to a union member, only that member needs to be copied,
and I'm not sure what happens to other members that have a larger size.
As far as I can read in the standard, it's Undefined Behavior.  Maybe
someone can correct me.

When assigning to a union (not its members), the standard doesn't seem
to specify at all the semantics, so I can't really tell.  Depending on
the semantics of assigning to a member, I can tell.  At first glance, I
don't see anything we should worry.

> These might be the reasons the std flags an error when such union
> objects are assigned to - assignment implies type-erasing memcpy.
> 
> >          (including, recursively,
> >          any member or element of all contained aggregates or unions)
> >          with a const- qualified type.
> >
> > (Modifying <http://port70.net/~nsz/c/c11/n1570.html#6.3.2.1p1>)
> 
> [ ... ]
> 
> > Modifying a union via a non-const member is fine in C, I believe.  I
> 
> That's true, but as you noted in your other reply, unions allow diluting
> the type-safety to a great extent.

Not really.  If well used, they do actually improve type safety.  Think
of my string unions vs the struct proposed by Martin.  My union has
const correctness, which is impossible to achieve via structs.  A
similar problem happens with struct sockaddr, which for historic reasons
is a series of unrelated structures.  If they had been designed as a
hierarchic union, it wouldn't be so problematic, and you wouldn't need
to cast everywhere to use them.

> 
> Linux kernel has a piece of code that does modify such a union
> although it possesses decorations for the sparse semantic checker.
> In its essence, it is similar to:
> 
> union {
>     const int flags;
>     int __attribute__ ((noderef)) __flags;
> };
> See https://sparse.docs.kernel.org/en/latest/annotations.html#noderef
> 
> [ ... ]
> 
> > > I think it is better to have a 'class' and associated APIs.
> >
> > But we can't have that in C.
> 
> By 'class', I meant a rigid way of accessing the struct strictly through the
> provided APIs.

That's precisely what a union offers.  You can access it via a union
member, which then restricts you to use the functions that accept that
type, or you can keep the entire union for unlimited use.

> 
> [ ... ]
> 
> > The main concern I have with that proposal is the same concern I've had
> > with strings in Nginx so far: you can't really make them `const`.
> > Unless you make the type opaque, and only provide accessors via
> > functions that protect the strings even if they could modify them.
> 
> The type need not be opaque, to allow for optimizations. But the users of
> the type must behave as if the type were opaque. The bulk of the API can
> be 'static inline' functions within the header, to allow for optimizations and
> inlining within a single translation unit. There's also LTO and thin-LTOs to
> carry out optimizations across translation units, if needed.

Indeed.  But I'm not sure I'd like a string type that I must treat as a
FILE*.  Also, how do you initialize such strings?  With an EMPTY_STRING
macro?  Do you always get a heap pointer, so you can't declare a string
in the stack as this?

	struct string  s;

What's the content of that string?  Is a NULL pointer a valid state of
the string?  In NGINX we found this week several cases where we were
doing pointer arithmetics on strings that were holding NULL, precisely
due to this (or it was after some calloc(3), more likely, I don't
remember).  It was NULL + 0, which is the least of undefined behaviors
I'm concerned about, but hey, there's another NULL to worry about.

So, you'll need to do

	struct string  s = EMPTY_STRING;

to avoid ever having a NULL, which is yet another thing to worry about.

Having suffered these strings, I wouldn't recommend them as something
better than a char*.  In NGINX, we have them due to historic reasons,
and rewriting the entire code base would be unreasonable, and while they
may have been a good optimization 2 decades ago, I bet that it's a
useless micro-optimization today, and we could just use char* everywhere
to have less problems.

> > You can only make them const, if you use two distinct types: a read-only
> > version, let's call it rstring, and a read-write version, let's call it
> > string.
> 
> But do we need both views (r/w and r/o) at the same time?

Yes: I want to be able to take a r/w string and pass it to a function
that accepts a r/o string.  Since distinct types are incompatible, and
this is the main problem with things like sockaddr, you can't just cast;
that's likely to invoke Undefined Behavior.

It's the same as I can pass a char* to a function that accepts a
const char*, without doing a cast.

> 
> >
> >         struct rstring_s {
> >             size_t                        length;
> >             const char                    *start;
> >         };
> >
> >
> >         union nxt_string_u {
> >             struct {
> >                 size_t                    length;
> >                 char                      *start;
> >             };
> >             struct {
> >                 size_t                    length;
> >                 char                      *start;
> >             } w;
> >             const rstring_t               r;
> >         };
> >
> > In Nginx we have another complexity: we don't necessarily terminate our
> > strings: this allows getting a substring in the middle of another string
> > without needing to make an actual copy of the memory.  But then it means
> > we need more types to have type safety.  I haven't finished developing
> > that, so I can't tell you if the code below does work, but this is what
> > I'm really working with at the moment:
> >
> >         struct nxt_rstr_s {
> >             size_t                          length;
> >             const u_char                    *start;
> >         };
> >
> >
> >         union nxt_str_u {
> >             struct {
> >                 size_t                      length;
> >                 u_char                      *start;
> >             };
> >             struct {
> >                 size_t                      length;
> >                 u_char                      *start;
> >             } w;
> >             const nxt_rstr_t                r;
> >         };
> >
> >
> >         union nxt_rstrz_u {
> >             struct {
> >                 size_t                      length;
> >                 union {
> >                     const u_char            *start;
> >                     const char              *cstrz;
> >                 };
> >             };
> >             struct {
> >                 size_t                      length;
> >                 const u_char                *start;
> >             } w;
> >             const nxt_rstr_t                r;
> >         };
> >
> >
> >         union nxt_strz_u {
> >             struct {
> >                 size_t                      length;
> >                 union {
> >                     u_char                  *start;
> >                     char                    *cstrz;
> >                 };
> >             };
> >             struct {
> >                 size_t                      length;
> >                 u_char                      *start;
> >             } w;
> >             const nxt_rstr_t                r;
> >             const nxt_rstrz_t               rz;
> >         };
> >
> > Structures `***z` contain null-terminated strings, while the other ones
> > don't.  You can read terminated strings as non-terminated ones, but not
> > the other way.  And you can access writable strings as read-only
> > strings, but not the other way around.
> >
> > (We use `u_char` to avoid the problems that `char` has due to its
> > ambiguous sign; I would personally prefer using -funsigned-char, but
> > that's what it is, for historic reasons.)
> > Anyway, that `u_char` makes sure we don't mix our strings with libc
> > calls accidentally, and I only provide the `cstrz` member in unions that
> > actually provide a libc-compatible string view.
> 
> Passing these structs+unions is still passing the string as r/w. It does
> not seem too far away from passing non-const pointers and sizes.

The thing is, I don't always pass the union.

If I want to write the string in a function, I pass the union.  If I
want it r/o, I pass the const member.

> In a sort of a project of mine, I had to implement a buffer that can be used
> to store ref-counted (non-terminated) strings and binary data. I did initially
> create two different types, buf_rw and buf_ro, but then scrapped that for
> what I have now, a skeleton of which is shown below:
> 
> struct string {
>     size_t size;
>     const char *data;
> };
> 
> data[i] is now marked as non-modifiable.
> 
> Passing 'const struct string *' to a function implies that the function cannot
> modify (in addition to data[i]), the fields size and data, through the
> passed ptr.
> 
> A function that is passed 'struct string *' can still be allowed to
> inplace-modify data[i] by invoking appropriate string APIs that temporarily
> casts away the constness of data[i].

That would be a violation of const correctness.  If necessary, yeah, go
ahead.  But personally, I'd prefer avoiding that.  For something that is
proposed as an inclusion to ISO C, I'd rather not have it.

I would prefer that a libstring adds those APIs, rather than having them
in the standard.  They aren't uncontroversial.  Neither char* is
uncontroversial, but at least it's simple.

> The SEI CERT coding standard for C allows for an exception for such
> modifications. See
> https://wiki.sei.cmu.edu/confluence/display/c/EXP05-C.+Do+not+cast+away+a+const+qualification#:~:text=end%20is%20undefined%20*/-,EXP05%2DC%2DEX3,-%3A%20Because%20const
> 
> Some other options:
> Make use of the sparse semantic checker.
> 
> It might help to name the fields with underscores to highlight the fact that
> they are internal fields.  For e.g. 'str_size__' and 'str_data__'. Any use of
> these fields outside of the dedicated API functions can now be easily
> traced by simple grep searches.
> 
> It might also help to create a visible type such as
> 
> struct string {
>     const uintptr_t raw[2];
> };
> 
> and let the API manage the translation between uintptr_t and size_t/char *.

But you'll need a transparent union to do the magic.  All of that shows
that it isn't an easy thing through structures.  unions at least have it
easier.

Cheers,
Alex

-- 
<https://www.alejandro-colomar.es/>
Looking for a remote C programming job at the moment.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Assignment of union containing const-qualifier member
  2024-02-07 13:29         ` Alejandro Colomar
@ 2024-02-12 10:45           ` Amol Surati
  2024-02-12 11:18             ` Amol Surati
  2024-03-18  9:19             ` Alejandro Colomar
  0 siblings, 2 replies; 12+ messages in thread
From: Amol Surati @ 2024-02-12 10:45 UTC (permalink / raw)
  To: Alejandro Colomar; +Cc: gcc-help

Hi Alex,

On Wed, 7 Feb 2024 at 18:59, Alejandro Colomar <alx@kernel.org> wrote:
>
> Hi Amol,
>
> On Wed, Feb 07, 2024 at 09:43:55AM +0530, Amol Surati wrote:
[ ... ]
> The fact that the compiler does the equivalent of memcpy(3) doesn't
> imply a removal of type safety.
>
> The copiler is aware of the union type at the assignment, and so can
> emit diagnostics if anything is wrong.

True. That's the diagnostics that was emitted when running 'v=u'.

>
> When assigning to a union member, only that member needs to be copied,
> and I'm not sure what happens to other members that have a larger size.
> As far as I can read in the standard, it's Undefined Behavior.  Maybe
> someone can correct me.

The std says:
"When a value is stored in a member of an object of union type, the bytes
of the object representation that do not correspond to that member but
do correspond to other members take unspecified values."

> When assigning to a union (not its members), the standard doesn't seem
> to specify at all the semantics, so I can't really tell.  Depending on
> the semantics of assigning to a member, I can tell.  At first glance, I
> don't see anything we should worry.

[ ... ]
> Not really.  If well used, they do actually improve type safety.  Think
> of my string unions vs the struct proposed by Martin.  My union has
> const correctness, which is impossible to achieve via structs.  A

I am not sure I fully understand how using structs forbids achieving
const-correctness (vs the unions as declared), but I am also not aware
of all the ways in which nginx uses strings.

[ ... ]

> Indeed.  But I'm not sure I'd like a string type that I must treat as a
> FILE*.  Also, how do you initialize such strings?  With an EMPTY_STRING
> macro?  Do you always get a heap pointer, so you can't declare a string
> in the stack as this?
>
>         struct string  s;
>
> What's the content of that string?  Is a NULL pointer a valid state of
> the string?  In NGINX we found this week several cases where we were
> doing pointer arithmetics on strings that were holding NULL, precisely
> due to this (or it was after some calloc(3), more likely, I don't
> remember).  It was NULL + 0, which is the least of undefined behaviors
> I'm concerned about, but hey, there's another NULL to worry about.
>
> So, you'll need to do
>
>         struct string  s = EMPTY_STRING;
>
> to avoid ever having a NULL, which is yet another thing to worry about.
>
> Having suffered these strings, I wouldn't recommend them as something
> better than a char*.  In NGINX, we have them due to historic reasons,
> and rewriting the entire code base would be unreasonable, and while they
> may have been a good optimization 2 decades ago, I bet that it's a
> useless micro-optimization today, and we could just use char* everywhere
> to have less problems.

If a variable is going to be overwritten without being read first, then there's
no need to initialize it. Within my implementation, the state resulting from
zero-init, i.e. 'struct string s = {0};', is deliberately chosen to be
well-defined
for the implementation. For e.g. appending a char to a zero-init'd string is
well-defined.

> > > You can only make them const, if you use two distinct types: a read-only
> > > version, let's call it rstring, and a read-write version, let's call it
> > > string.
> >
> > But do we need both views (r/w and r/o) at the same time?
>
> Yes: I want to be able to take a r/w string and pass it to a function
> that accepts a r/o string.  Since distinct types are incompatible, and
> this is the main problem with things like sockaddr, you can't just cast;
> that's likely to invoke Undefined Behavior.

The std says:
"A pointer to an object type may be converted to a pointer to a different
object type. If the resulting pointer is not correctly aligned for the
referenced type, the behavior is undefined. Otherwise, when converted
back again, the result shall compare equal to the original pointer."

So typecasting between object types is well-defined, given that the
resulting pointer is indeed correctly aligned for the target type. The
various sock_addr types all impose a common default alignment,
because they all begin with a member of the same type.

The std. also says:
"An object shall have its stored value accessed only by an lvalue
expression that has one of the following types:
    . . .
    — a character type."

For e.g., the _sys_bind system-call under Linux kernel receives the
same 'const struct sockaddr *', along with the size of the data, from
glibc. The kernel then copies the data using memcpy (or equivalent
routines).

Does the pointer conversion forbid the kernel from memcpy'ing the
object using that converted pointer? (It is likely that the converted
pointer, if it is not of compatible type, may not be dereferenced using
that type; but here the kernel isn't dereferencing the pointer using
the 'const struct sockaddr *' type; it is reading the obj-representation
using memcpy.)

If no, then this behaviour is well-defined.

If yes, then this behaviour is undefined, although, in practice, both
gcc and clang (the compilers that can compile Linux kernels) must be
allowing the expected behaviour here, since the compiled kernel
binaries do indeed work wherever they are deployed. This only means
that, *if* the behaviour turns out to be theoretically undefined, *then*
the std. can be augmented to specifically allow reading of
representations of objects pointed to by such suitably converted pointers,
if it does not already allow such reads.

[ ... ]

> > A function that is passed 'struct string *' can still be allowed to
> > inplace-modify data[i] by invoking appropriate string APIs that temporarily
> > casts away the constness of data[i].
>
> That would be a violation of const correctness.  If necessary, yeah, go
> ahead.  But personally, I'd prefer avoiding that.  For something that is
> proposed as an inclusion to ISO C, I'd rather not have it.

I keep two bits of info within the string:
(1) whether the 'struct string' is heap-allocated, or not.
(2) whether the data (const char *) it points to is heap-allocated, or not.

The std. says:
"If an attempt is made to modify an object defined with a const-qualified
type through use of an lvalue with non-const-qualified type, the behavior
is undefined."

Given that a malloc'd object doesn't have its own declaration or a
declared type to begin with, it also doesn't have a definition, let alone
a const-qualified definition. Modifying malloc'd objects should not cause
const-correctness to be violated, even when casting away const on a
pointer that points to such an object.

Modifiable strings can be created from statically allocated data (for e.g.,
pointing inside the .rodata section), and the data can be malloc'd and
changed/appended-to on a copy-on-write basis.

-Amol

>
> I would prefer that a libstring adds those APIs, rather than having them
> in the standard.  They aren't uncontroversial.  Neither char* is
> uncontroversial, but at least it's simple.
>
> > The SEI CERT coding standard for C allows for an exception for such
> > modifications. See
> > https://wiki.sei.cmu.edu/confluence/display/c/EXP05-C.+Do+not+cast+away+a+const+qualification#:~:text=end%20is%20undefined%20*/-,EXP05%2DC%2DEX3,-%3A%20Because%20const
> >
> > Some other options:
> > Make use of the sparse semantic checker.
> >
> > It might help to name the fields with underscores to highlight the fact that
> > they are internal fields.  For e.g. 'str_size__' and 'str_data__'. Any use of
> > these fields outside of the dedicated API functions can now be easily
> > traced by simple grep searches.
> >
> > It might also help to create a visible type such as
> >
> > struct string {
> >     const uintptr_t raw[2];
> > };
> >
> > and let the API manage the translation between uintptr_t and size_t/char *.
>
> But you'll need a transparent union to do the magic.  All of that shows
> that it isn't an easy thing through structures.  unions at least have it
> easier.
>
> Cheers,
> Alex
>
> --
> <https://www.alejandro-colomar.es/>
> Looking for a remote C programming job at the moment.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Assignment of union containing const-qualifier member
  2024-02-12 10:45           ` Amol Surati
@ 2024-02-12 11:18             ` Amol Surati
  2024-03-18  9:19             ` Alejandro Colomar
  1 sibling, 0 replies; 12+ messages in thread
From: Amol Surati @ 2024-02-12 11:18 UTC (permalink / raw)
  To: Alejandro Colomar; +Cc: gcc-help

Typo in the sock_addr section.

Please read
"So typecasting between object types is well-defined, given that the"
as
"So typecasting between pointer types is well-defined, given that the"

> The std says:
> "A pointer to an object type may be converted to a pointer to a different
> object type. If the resulting pointer is not correctly aligned for the
> referenced type, the behavior is undefined. Otherwise, when converted
> back again, the result shall compare equal to the original pointer."
>
> So typecasting between object types is well-defined, given that the
> resulting pointer is indeed correctly aligned for the target type. The
> various sock_addr types all impose a common default alignment,
> because they all begin with a member of the same type.

-Amol

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Assignment of union containing const-qualifier member
  2024-02-12 10:45           ` Amol Surati
  2024-02-12 11:18             ` Amol Surati
@ 2024-03-18  9:19             ` Alejandro Colomar
  2024-03-18  9:23               ` Alejandro Colomar
  1 sibling, 1 reply; 12+ messages in thread
From: Alejandro Colomar @ 2024-03-18  9:19 UTC (permalink / raw)
  To: Amol Surati; +Cc: gcc-help

[-- Attachment #1: Type: text/plain, Size: 8728 bytes --]

Hi Amol,

On Mon, Feb 12, 2024 at 04:15:24PM +0530, Amol Surati wrote:
> Hi Alex,
> 
> On Wed, 7 Feb 2024 at 18:59, Alejandro Colomar <alx@kernel.org> wrote:
> >
> > Hi Amol,
> >
> > On Wed, Feb 07, 2024 at 09:43:55AM +0530, Amol Surati wrote:
> [ ... ]
> > The fact that the compiler does the equivalent of memcpy(3) doesn't
> > imply a removal of type safety.
> >
> > The copiler is aware of the union type at the assignment, and so can
> > emit diagnostics if anything is wrong.
> 
> True. That's the diagnostics that was emitted when running 'v=u'.
> 
> >
> > When assigning to a union member, only that member needs to be copied,
> > and I'm not sure what happens to other members that have a larger size.
> > As far as I can read in the standard, it's Undefined Behavior.  Maybe
> > someone can correct me.
> 
> The std says:
> "When a value is stored in a member of an object of union type, the bytes
> of the object representation that do not correspond to that member but
> do correspond to other members take unspecified values."

Hmm, interesting.

> > When assigning to a union (not its members), the standard doesn't seem
> > to specify at all the semantics, so I can't really tell.  Depending on
> > the semantics of assigning to a member, I can tell.  At first glance, I
> > don't see anything we should worry.
> 
> [ ... ]
> > Not really.  If well used, they do actually improve type safety.  Think
> > of my string unions vs the struct proposed by Martin.  My union has
> > const correctness, which is impossible to achieve via structs.  A
> 
> I am not sure I fully understand how using structs forbids achieving
> const-correctness (vs the unions as declared),


> but I am also not aware
> of all the ways in which nginx uses strings.
> 
> [ ... ]
> 
> > Indeed.  But I'm not sure I'd like a string type that I must treat as a
> > FILE*.  Also, how do you initialize such strings?  With an EMPTY_STRING
> > macro?  Do you always get a heap pointer, so you can't declare a string
> > in the stack as this?
> >
> >         struct string  s;
> >
> > What's the content of that string?  Is a NULL pointer a valid state of
> > the string?  In NGINX we found this week several cases where we were
> > doing pointer arithmetics on strings that were holding NULL, precisely
> > due to this (or it was after some calloc(3), more likely, I don't
> > remember).  It was NULL + 0, which is the least of undefined behaviors
> > I'm concerned about, but hey, there's another NULL to worry about.
> >
> > So, you'll need to do
> >
> >         struct string  s = EMPTY_STRING;
> >
> > to avoid ever having a NULL, which is yet another thing to worry about.
> >
> > Having suffered these strings, I wouldn't recommend them as something
> > better than a char*.  In NGINX, we have them due to historic reasons,
> > and rewriting the entire code base would be unreasonable, and while they
> > may have been a good optimization 2 decades ago, I bet that it's a
> > useless micro-optimization today, and we could just use char* everywhere
> > to have less problems.
> 
> If a variable is going to be overwritten without being read first, then there's
> no need to initialize it. Within my implementation, the state resulting from
> zero-init, i.e. 'struct string s = {0};', is deliberately chosen to be
> well-defined
> for the implementation. For e.g. appending a char to a zero-init'd string is
> well-defined.

Good.  Reading a zero-init'd string would return something like "", I
guess.  Probably needs some special-case in the code, but ok.

> > > > You can only make them const, if you use two distinct types: a read-only
> > > > version, let's call it rstring, and a read-write version, let's call it
> > > > string.
> > >
> > > But do we need both views (r/w and r/o) at the same time?
> >
> > Yes: I want to be able to take a r/w string and pass it to a function
> > that accepts a r/o string.  Since distinct types are incompatible, and
> > this is the main problem with things like sockaddr, you can't just cast;
> > that's likely to invoke Undefined Behavior.
> 
> The std says:
> "A pointer to an object type may be converted to a pointer to a different
> object type. If the resulting pointer is not correctly aligned for the
> referenced type, the behavior is undefined. Otherwise, when converted
> back again, the result shall compare equal to the original pointer."
> 
> So typecasting between object types is well-defined, given that the

Yes, casting pointers is fine.

> resulting pointer is indeed correctly aligned for the target type. The

But having the resulting pointer be correctly aligned is enough for
allowing a pointer conversion.  But the only thing that the standard
allows to do with that pointer is to convert it back.  It is not allowed
to dereference it, even if the alignment is correct.

<http://port70.net/~nsz/c/c11/n1570.html#6.3.2.3p7>

> various sock_addr types all impose a common default alignment,
> because they all begin with a member of the same type.
> 
> The std. also says:
> "An object shall have its stored value accessed only by an lvalue
> expression that has one of the following types:
>     . . .
>     — a character type."
> 
> For e.g., the _sys_bind system-call under Linux kernel receives the
> same 'const struct sockaddr *', along with the size of the data, from
> glibc. The kernel then copies the data using memcpy (or equivalent
> routines).
> 
> Does the pointer conversion forbid the kernel from memcpy'ing the
> object using that converted pointer? (It is likely that the converted
> pointer, if it is not of compatible type, may not be dereferenced using
> that type; but here the kernel isn't dereferencing the pointer using
> the 'const struct sockaddr *' type; it is reading the obj-representation
> using memcpy.)
> 
> If no, then this behaviour is well-defined.

Nah, memcpy'ing is fair game.

> If yes, then this behaviour is undefined, although, in practice, both
> gcc and clang (the compilers that can compile Linux kernels) must be
> allowing the expected behaviour here, since the compiled kernel
> binaries do indeed work wherever they are deployed. This only means
> that, *if* the behaviour turns out to be theoretically undefined, *then*
> the std. can be augmented to specifically allow reading of
> representations of objects pointed to by such suitably converted pointers,
> if it does not already allow such reads.

However, when the kernel writes data that the user should read, it does
so via memcpy on a sockaddr_storage.  How is the user supposed to read
that data?

Say we define a sockaddr_storage, pass it to getpeername(2), where the
kernel writes to it, and then we want to read it.  Which type should we
use?  Imagine that it's a Unix socket, so it should be as sockaddr_un.
But it was never written as that type, so we can't.

<https://austingroupbugs.net/view.php?id=1641>

> [ ... ]
> 
> > > A function that is passed 'struct string *' can still be allowed to
> > > inplace-modify data[i] by invoking appropriate string APIs that temporarily
> > > casts away the constness of data[i].
> >
> > That would be a violation of const correctness.  If necessary, yeah, go
> > ahead.  But personally, I'd prefer avoiding that.  For something that is
> > proposed as an inclusion to ISO C, I'd rather not have it.
> 
> I keep two bits of info within the string:
> (1) whether the 'struct string' is heap-allocated, or not.
> (2) whether the data (const char *) it points to is heap-allocated, or not.
> 
> The std. says:
> "If an attempt is made to modify an object defined with a const-qualified
> type through use of an lvalue with non-const-qualified type, the behavior
> is undefined."
> 
> Given that a malloc'd object doesn't have its own declaration or a
> declared type to begin with, it also doesn't have a definition, let alone
> a const-qualified definition. Modifying malloc'd objects should not cause
> const-correctness to be violated, even when casting away const on a
> pointer that points to such an object.
> 
> Modifiable strings can be created from statically allocated data (for e.g.,
> pointing inside the .rodata section), and the data can be malloc'd and
> changed/appended-to on a copy-on-write basis.

Hmmm, while casting away const is valid, it's not safe.  I'm not sure
I'd call that const-correct.  It's not UB, that I agree.  But you're
bypassing type safety when you do a cast.

Have a lovely day!
Alex

-- 
<https://www.alejandro-colomar.es/>
Looking for a remote C programming job at the moment.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Assignment of union containing const-qualifier member
  2024-03-18  9:19             ` Alejandro Colomar
@ 2024-03-18  9:23               ` Alejandro Colomar
  0 siblings, 0 replies; 12+ messages in thread
From: Alejandro Colomar @ 2024-03-18  9:23 UTC (permalink / raw)
  To: Amol Surati; +Cc: gcc-help

[-- Attachment #1: Type: text/plain, Size: 478 bytes --]

Hi Amol,

Let me ask these questions:

Why does ISO C need a string structure?
Why isn't it enough that some library implements such a string
structure?  At least for some decades, until time has proven that this
is the string we all want.  If it's a good type, and the library is
portable, there shouldn't be a real difference if it's standard or not.

Cheers,
Alex

-- 
<https://www.alejandro-colomar.es/>
Looking for a remote C programming job at the moment.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-03-18  9:23 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-30 21:45 Assignment of union containing const-qualifier member Alejandro Colomar
2024-01-31 18:14 ` Alejandro Colomar
2024-02-04  7:33   ` Amol Surati
2024-02-04  8:18     ` Amol Surati
2024-02-04 18:40     ` Alejandro Colomar
2024-02-04 20:37       ` Alejandro Colomar
2024-02-07  4:13       ` Amol Surati
2024-02-07 13:29         ` Alejandro Colomar
2024-02-12 10:45           ` Amol Surati
2024-02-12 11:18             ` Amol Surati
2024-03-18  9:19             ` Alejandro Colomar
2024-03-18  9:23               ` Alejandro Colomar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).