From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dfw.source.kernel.org (dfw.source.kernel.org [IPv6:2604:1380:4641:c500::1]) by sourceware.org (Postfix) with ESMTPS id EAE9C3858CDB for ; Sun, 4 Feb 2024 18:40:35 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org EAE9C3858CDB Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=kernel.org Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=kernel.org ARC-Filter: OpenARC Filter v1.0.0 sourceware.org EAE9C3858CDB Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2604:1380:4641:c500::1 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1707072038; cv=none; b=wyYN1pdoGhG2v3CE4o5vWo02YfiVwMOxDCuXkPLaZcLnsLBBKkO5pLD0xSiejvZxDfiBlOoXJib4qvCT+SGHQRr8stPYNzGJNTkPe6qUqPMXaDpgbKWYnRNg+kami3Lp0gTpkq3BNObFS2tzqB1oRA6e9IBc+HkWmPI0AGebNOM= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1707072038; c=relaxed/simple; bh=qAltJZRu++AL3eqnropU829lbwzhgIUE5mFOrzft+hY=; h=DKIM-Signature:Date:From:To:Subject:Message-ID:MIME-Version; b=poGAxpeCeynVf5i2NMBp3T+XKFvRw3lACX7Yth7yOX0VV+he0uSOp4pZazV2KBqt1WdOcFgv3r3WLKS2r2kBPYAfkm4zowBjtrsLlv22Bd7eZBFcgalQ31OHT15sFMg1MWve9ReCwLrij7ATizmp0GkagG6zJ9KcwBgfoR1VOVo= ARC-Authentication-Results: i=1; server2.sourceware.org Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id 1553960EA2; Sun, 4 Feb 2024 18:40:33 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id F1B0EC433C7; Sun, 4 Feb 2024 18:40:31 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1707072032; bh=qAltJZRu++AL3eqnropU829lbwzhgIUE5mFOrzft+hY=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=Vuh2GlRLh3n/eqjcTeZho4X4nWhNF2SA2fRzvV2fnFzrxMZ6JjpDzlhIpsu0+/FVS UmiBvQ4livoA8KGfWFpUqR1MtwEznLZEWwbn9wOXtekuGm0wB6pEox4OkJOkiTzNms Kc9JJtSSps0SdC/Osm7LI9CuKpqQ98uuEDNiBjAOaGrG9WTss3HsyrjN/BJFxkStoo qA3hc/FqelbS7H2k7AmCBb+SyTsNwXn8oOX228NgljRbJuCh9wrtKRSGy071qqFbrK QpmGUK/X9PZ4w9qDQry2kiQ+jqCyB71tyBLLMay32tNs8VGMoJDcFTau8V85XugZ// K1DhzCmCpRcLQ== Date: Sun, 4 Feb 2024 19:40:23 +0100 From: Alejandro Colomar To: Amol Surati Cc: gcc-help@gcc.gnu.org Subject: Re: Assignment of union containing const-qualifier member Message-ID: References: MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="UECJN987+fN3IYFH" Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-4.0 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: --UECJN987+fN3IYFH Content-Type: text/plain; protected-headers=v1; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Date: Sun, 4 Feb 2024 19:40:23 +0100 From: Alejandro Colomar To: Amol Surati Cc: gcc-help@gcc.gnu.org Subject: Re: Assignment of union containing const-qualifier member Hi Amol, On Sun, Feb 04, 2024 at 01:03:48PM +0530, Amol Surati wrote: > On Wed, 31 Jan 2024 at 23:46, Alejandro Colomar via Gcc-help > wrote: > > > > On Tue, Jan 30, 2024 at 10:45:11PM +0100, Alejandro Colomar wrote: > > > Hi, > > > >=20 > [ ... ] >=20 > > structure, that doesn't help. memcpy(3) does help, but it looses all > > type safety. > > > > Maybe this could be allowed as an extension. Any thoughts? > > >=20 > Does it make sense to propose that, if the first top-level member of a > union is completely (i.e. recursively) writable, then a non-const union > object as a whole is writable? If so, then, for union objects a and b of > a union that has such const members, a =3D b can be expected to not > raise errors about const-correctness. To have a specific proposal, I'll specify it as a diff of ISO C11: $ diff -u c11 suggestion=20 --- c11 2024-02-04 19:37:27.520851005 +0100 +++ suggestion 2024-02-04 19:38:56.785402567 +0100 @@ -8,8 +8,8 @@ does not have array type, does not have an incomplete type, does not have a const- qualified type, -and if it is a structure or union, -does not have any member +and if it is a structure does not have any member, +or if it is a union does not have all members, (including, recursively, any member or element of all contained aggregates or unions) with a const- qualified type. (Modifying ) >=20 > It seems that a union only provides a view of the object. The union > object doesn't automatically become const qualified if a member > of the union is const-qualified. This seems to be the reason v.w =3D u.w > works; otherwise, that modification can also be viewed as the > modification of an object (v.r) defined with a const-qualified type throu= gh > the use of an lvalue (v.w) with non-const-qualified type - something that= 's > forbidden by the std. Modifying a union via a non-const member is fine in C, I believe. I think you're creating a new object, and discarding the old one, so you don't need to care if there was an old object defined via a const-qualified type. That is, the following code is valid C, AFAIK: alx@debian:~/tmp$ cat u.c=20 union u { int a; const int b; }; int main(void) { union u u =3D {.b =3D 42}; u.a =3D 7; return u.b; } alx@debian:~/tmp$ gcc-14 -Wall -Wextra u.c=20 alx@debian:~/tmp$ ./a.out ; echo $? 7 alx@debian:~/tmp$ clang-17 -Weverything u.c=20 alx@debian:~/tmp$ ./a.out ; echo $? 7 > More towards the use of the string as described: > If there are multiple such union objects that point to the same string, > and if a piece of code decides to modify the string, other consumers of > this string remain unaware of the modification, unless they check for it, > for e.g., by keeping a copy, calc. hash, etc., to ensure that the string = was > indeed not silently modified behind their backs. `const` only guarantees that an object is not modified through that pointer. As long as you keep another pointer to the same object, it can be modified via that other pointer. To guarantee that an object is really constant --at least for what concerns a function--, you need to also specify `restrict`. If you have a `const type* restrict`, then you know for sure it is constant, as far as the current function is concerned. If you're worried about multi-threaded programs, well, unions aren't any more problematic here than passing a `const T *restrict` to a function, and modifying it in another thread via a non-const lvalue. As long as the original object wasn't const, that's fair game. It's the programmer's task to make sure the functions behave well if that can happen. >=20 > I think it is better to have a 'class' and associated APIs. But we can't have that in C. > See [1], for e.g., or the implementation of c++ std::string. >=20 > The ownership of an object of such a class can be passed by passing > a non-const pointer to the object. >=20 > Functions that are not supposed to own the object can be passed a > const pointer. Despite that, if such functions need to modify it for local > needs, they can create a copy to work with. >=20 > One can additionally maintain a ref-count on the char pointer, to avoid > having to unnecessarily copy a string if it is going to be placed in seve= ral > stay-resident-after-return data-structures. I normally prefer simple C strings, with a simple pointer. The reason I'm using this struct+union is performance. In nginx, to reduce memory consumption (you can get substrings by copying a pointer and specifying a length), and also avoid calculating lengths of strings more than once, we use these structures. So far, we were using a simple struct: typedef struct string { size_t length; u_char *start; } string_t; // it has a different name, but let's keep it simple But that means we basically can't use `const` at all with our strings. Because if you specify void foo(const string_t *str); that means that you can't modify the pointer, but you can actually modify the pointee. Which means that you can't guarantee that a string isn't corrupted after some call, unless you inspect all the code that the function calls, recursively. I started working on a way to improve these strings around a year ago, and have recently come up with something. >=20 > -Amol >=20 > [1] https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3210.pdf Maybe you can get something from what I've learnt with strings in Nginx, since they're quite close to what that proposal has. The main concern I have with that proposal is the same concern I've had with strings in Nginx so far: you can't really make them `const`. Unless you make the type opaque, and only provide accessors via functions that protect the strings even if they could modify them. You can only make them const, if you use two distinct types: a read-only version, let's call it rstring, and a read-write version, let's call it string. struct rstring_s { size_t length; const char *start; }; union nxt_string_u { struct { size_t length; char *start; }; struct { size_t length; char *start; } w; const rstring_t r; }; In Nginx we have another complexity: we don't necessarily terminate our strings: this allows getting a substring in the middle of another string without needing to make an actual copy of the memory. But then it means we need more types to have type safety. I haven't finished developing that, so I can't tell you if the code below does work, but this is what I'm really working with at the moment: struct nxt_rstr_s { size_t length; const u_char *start; }; union nxt_str_u { struct { size_t length; u_char *start; }; struct { size_t length; u_char *start; } w; const nxt_rstr_t r; }; union nxt_rstrz_u { struct { size_t length; union { const u_char *start; const char *cstrz; }; }; struct { size_t length; const u_char *start; } w; const nxt_rstr_t r; }; union nxt_strz_u { struct { size_t length; union { u_char *start; char *cstrz; }; }; struct { size_t length; u_char *start; } w; const nxt_rstr_t r; const nxt_rstrz_t rz; }; Structures `***z` contain null-terminated strings, while the other ones don't. You can read terminated strings as non-terminated ones, but not the other way. And you can access writable strings as read-only strings, but not the other way around. (We use `u_char` to avoid the problems that `char` has due to its ambiguous sign; I would personally prefer using -funsigned-char, but that's what it is, for historic reasons.) Anyway, that `u_char` makes sure we don't mix our strings with libc calls accidentally, and I only provide the `cstrz` member in unions that actually provide a libc-compatible string view. Have a lovely day, Alex --=20 Looking for a remote C programming job at the moment. --UECJN987+fN3IYFH Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE6jqH8KTroDDkXfJAnowa+77/2zIFAmW/2hcACgkQnowa+77/ 2zIBWQ//Srg3vlIcy0V8ribszRpNhAACS/PgJmUmfxS/RFVtjKx9CcxhdaYGIp13 oAjVFVp+pmI1wBR3uqMsgoMvkpPBIRHgFyZO0BADRKojjtgR02lLvz8eFjAtBiQw E3r94beTAcYETgzMzd59YVWIwyta3w6DrqBihUbrWYrjilxZ8cCaKxWMoaArBLs8 YjeDVG/fIOdW6s62QBILqdmmpSOkDz7kVl0p/rtEWSDFbyqleXiyJVh5X4+cC7gb lMEfTQz/KW34VOBWVtWGeiPYe7sflIte/RawzJ2ao5z613T6VDLJZmvr9mH5HXRD PhD9d8fDojZkvlvjSc77zgA3wZ+tU2w6BKSnZ7VZYbi0OrRzuYbxOaZiZwnYAkMn uZ0Kqk/fivxvuhShFB9a5jNhGATFciFFx3z1EVKGn1RqsGchXUKtQf53tc4iy/ik FG9cUJ/S1BwQqIvLqcRAIDMhZQA8FhgXrlMH/l45/fFpYPvUa8edot/rXq3eeO+u spVNfTZCl9uBbd5pXpOCb1bqz65TzrkAN1vxQosNEDbt1SyKUSblQz1AB5Hd1HQ1 NMP2R0au5I7mhuAtBOjrA7NSECDBEVQbpumYFWZq1T8O7JxQyPP6lzZu4wtSUIpJ MCBJYUpkmSUxHccbsPpbiVjhaRtOY6M3ngpEn6iMLXcVg5QrO7g= =zD8H -----END PGP SIGNATURE----- --UECJN987+fN3IYFH--