From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from sin.source.kernel.org (sin.source.kernel.org [IPv6:2604:1380:40e1:4800::1]) by sourceware.org (Postfix) with ESMTPS id 91DDE3858408 for ; Wed, 7 Feb 2024 13:29:22 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 91DDE3858408 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=kernel.org Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=kernel.org ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 91DDE3858408 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2604:1380:40e1:4800::1 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1707312565; cv=none; b=qmHKWb6MxHtVE/vKbNqGSQL0Zmkv/ILXYclzkN5L+R0QBLe0ipFd4fdic36/uwyUR2SmDpj50zGP8O7IOcJdOrMX5kf28OwTJG/qXWvmKG/8IYBRI/UOTL6Adc2UKYQG8CELVh6wmE4dmm37ZKRvn40t6ptdVKo8tu0FN6ZxgKI= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1707312565; c=relaxed/simple; bh=kum+Dg7ysFWDeJLT4xahPR+5YF58XrbjO5cIcDQL4GE=; h=DKIM-Signature:Date:From:To:Subject:Message-ID:MIME-Version; b=Anq112jqiqtIkqXNhEp7hFJyOuaNOJrRpeN7iLndnRz9MPh9Ez+LhjMiBPD5gDsceIVJt+8ZHO8kZIgTrAiy6VrDlLgO77ABW4pnamg+IC+Qkq9sz5FutxEdEB4It69SYSzBKecwg/IXufJCcFpjXrGUDH5Z1yNs19OYnLZ1A20= ARC-Authentication-Results: i=1; server2.sourceware.org Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by sin.source.kernel.org (Postfix) with ESMTP id E56E2CE18FA; Wed, 7 Feb 2024 13:29:17 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 893D0C433F1; Wed, 7 Feb 2024 13:29:16 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1707312557; bh=kum+Dg7ysFWDeJLT4xahPR+5YF58XrbjO5cIcDQL4GE=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=G2L2PEOtws3a0kkDCBwBEb0+LkS5sBb1BRfpcqwQtC8qXQovRYkQfw/SaLyWIhKXD zCTrkxXL0ReT3WhKQ1VVeQLvE0I2rTR6aEcKNriX0/J9aWhlnOC4C+qDNMAi1dO554 WhVnLDX5qTfrkuh4iqyu1Dm8tOcoh+w4HwcYyrzI4riXxclxXiDf1t9+MWDVkjN4nF Ry65E2GrvrrVI6Uc0CmcGxgjeB5LOw0lhdwlGbGB+Nsep4IJ4yFzc4hxd/DFNnvaCD MTfaT85aby692bGmR1/7+cucTQqUrVrz6ZAtr6e9+amqf2GnlPDFpFFGwG1IZYRp6F j7KspG87T5y1Q== Date: Wed, 7 Feb 2024 14:29:08 +0100 From: Alejandro Colomar To: Amol Surati Cc: gcc-help@gcc.gnu.org Subject: Re: Assignment of union containing const-qualifier member Message-ID: References: MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="JJl885BjywXY1hu9" Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-3.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE,URI_DOTEDU autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: --JJl885BjywXY1hu9 Content-Type: text/plain; protected-headers=v1; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Date: Wed, 7 Feb 2024 14:29:08 +0100 From: Alejandro Colomar To: Amol Surati Cc: gcc-help@gcc.gnu.org Subject: Re: Assignment of union containing const-qualifier member Hi Amol, On Wed, Feb 07, 2024 at 09:43:55AM +0530, Amol Surati wrote: > > To have a specific proposal, I'll specify it as a diff of ISO C11: > > > > $ diff -u c11 suggestion > > --- c11 2024-02-04 19:37:27.520851005 +0100 > > +++ suggestion 2024-02-04 19:38:56.785402567 +0100 > > @@ -8,8 +8,8 @@ > > does not have array type, > > does not have an incomplete type, > > does not have a const- qualified type, > > -and if it is a structure or union, > > -does not have any member > > +and if it is a structure does not have any member, > > +or if it is a union does not have all members, >=20 > As mentioned in my other reply, the assignment from one object to > another (of the same union type that has a const member) will have > to copy the size of the largest member, regardless of which member > was active the last time the source object was written to. Such a > copy, I believe, is a sort of memcpy that the compiler inserts. And > as you had already noticed before, memcpy loses all type-safety. The fact that the compiler does the equivalent of memcpy(3) doesn't imply a removal of type safety. The copiler is aware of the union type at the assignment, and so can emit diagnostics if anything is wrong. When assigning to a union member, only that member needs to be copied, and I'm not sure what happens to other members that have a larger size. As far as I can read in the standard, it's Undefined Behavior. Maybe someone can correct me. When assigning to a union (not its members), the standard doesn't seem to specify at all the semantics, so I can't really tell. Depending on the semantics of assigning to a member, I can tell. At first glance, I don't see anything we should worry. > These might be the reasons the std flags an error when such union > objects are assigned to - assignment implies type-erasing memcpy. >=20 > > (including, recursively, > > any member or element of all contained aggregates or unions) > > with a const- qualified type. > > > > (Modifying ) >=20 > [ ... ] >=20 > > Modifying a union via a non-const member is fine in C, I believe. I >=20 > That's true, but as you noted in your other reply, unions allow diluting > the type-safety to a great extent. Not really. If well used, they do actually improve type safety. Think of my string unions vs the struct proposed by Martin. My union has const correctness, which is impossible to achieve via structs. A similar problem happens with struct sockaddr, which for historic reasons is a series of unrelated structures. If they had been designed as a hierarchic union, it wouldn't be so problematic, and you wouldn't need to cast everywhere to use them. >=20 > Linux kernel has a piece of code that does modify such a union > although it possesses decorations for the sparse semantic checker. > In its essence, it is similar to: >=20 > union { > const int flags; > int __attribute__ ((noderef)) __flags; > }; > See https://sparse.docs.kernel.org/en/latest/annotations.html#noderef >=20 > [ ... ] >=20 > > > I think it is better to have a 'class' and associated APIs. > > > > But we can't have that in C. >=20 > By 'class', I meant a rigid way of accessing the struct strictly through = the > provided APIs. That's precisely what a union offers. You can access it via a union member, which then restricts you to use the functions that accept that type, or you can keep the entire union for unlimited use. >=20 > [ ... ] >=20 > > The main concern I have with that proposal is the same concern I've had > > with strings in Nginx so far: you can't really make them `const`. > > Unless you make the type opaque, and only provide accessors via > > functions that protect the strings even if they could modify them. >=20 > The type need not be opaque, to allow for optimizations. But the users of > the type must behave as if the type were opaque. The bulk of the API can > be 'static inline' functions within the header, to allow for optimization= s and > inlining within a single translation unit. There's also LTO and thin-LTOs= to > carry out optimizations across translation units, if needed. Indeed. But I'm not sure I'd like a string type that I must treat as a FILE*. Also, how do you initialize such strings? With an EMPTY_STRING macro? Do you always get a heap pointer, so you can't declare a string in the stack as this? struct string s; What's the content of that string? Is a NULL pointer a valid state of the string? In NGINX we found this week several cases where we were doing pointer arithmetics on strings that were holding NULL, precisely due to this (or it was after some calloc(3), more likely, I don't remember). It was NULL + 0, which is the least of undefined behaviors I'm concerned about, but hey, there's another NULL to worry about. So, you'll need to do struct string s =3D EMPTY_STRING; to avoid ever having a NULL, which is yet another thing to worry about. Having suffered these strings, I wouldn't recommend them as something better than a char*. In NGINX, we have them due to historic reasons, and rewriting the entire code base would be unreasonable, and while they may have been a good optimization 2 decades ago, I bet that it's a useless micro-optimization today, and we could just use char* everywhere to have less problems. > > You can only make them const, if you use two distinct types: a read-only > > version, let's call it rstring, and a read-write version, let's call it > > string. >=20 > But do we need both views (r/w and r/o) at the same time? Yes: I want to be able to take a r/w string and pass it to a function that accepts a r/o string. Since distinct types are incompatible, and this is the main problem with things like sockaddr, you can't just cast; that's likely to invoke Undefined Behavior. It's the same as I can pass a char* to a function that accepts a const char*, without doing a cast. >=20 > > > > struct rstring_s { > > size_t length; > > const char *start; > > }; > > > > > > union nxt_string_u { > > struct { > > size_t length; > > char *start; > > }; > > struct { > > size_t length; > > char *start; > > } w; > > const rstring_t r; > > }; > > > > In Nginx we have another complexity: we don't necessarily terminate our > > strings: this allows getting a substring in the middle of another string > > without needing to make an actual copy of the memory. But then it means > > we need more types to have type safety. I haven't finished developing > > that, so I can't tell you if the code below does work, but this is what > > I'm really working with at the moment: > > > > struct nxt_rstr_s { > > size_t length; > > const u_char *start; > > }; > > > > > > union nxt_str_u { > > struct { > > size_t length; > > u_char *start; > > }; > > struct { > > size_t length; > > u_char *start; > > } w; > > const nxt_rstr_t r; > > }; > > > > > > union nxt_rstrz_u { > > struct { > > size_t length; > > union { > > const u_char *start; > > const char *cstrz; > > }; > > }; > > struct { > > size_t length; > > const u_char *start; > > } w; > > const nxt_rstr_t r; > > }; > > > > > > union nxt_strz_u { > > struct { > > size_t length; > > union { > > u_char *start; > > char *cstrz; > > }; > > }; > > struct { > > size_t length; > > u_char *start; > > } w; > > const nxt_rstr_t r; > > const nxt_rstrz_t rz; > > }; > > > > Structures `***z` contain null-terminated strings, while the other ones > > don't. You can read terminated strings as non-terminated ones, but not > > the other way. And you can access writable strings as read-only > > strings, but not the other way around. > > > > (We use `u_char` to avoid the problems that `char` has due to its > > ambiguous sign; I would personally prefer using -funsigned-char, but > > that's what it is, for historic reasons.) > > Anyway, that `u_char` makes sure we don't mix our strings with libc > > calls accidentally, and I only provide the `cstrz` member in unions that > > actually provide a libc-compatible string view. >=20 > Passing these structs+unions is still passing the string as r/w. It does > not seem too far away from passing non-const pointers and sizes. The thing is, I don't always pass the union. If I want to write the string in a function, I pass the union. If I want it r/o, I pass the const member. > In a sort of a project of mine, I had to implement a buffer that can be u= sed > to store ref-counted (non-terminated) strings and binary data. I did init= ially > create two different types, buf_rw and buf_ro, but then scrapped that for > what I have now, a skeleton of which is shown below: >=20 > struct string { > size_t size; > const char *data; > }; >=20 > data[i] is now marked as non-modifiable. >=20 > Passing 'const struct string *' to a function implies that the function c= annot > modify (in addition to data[i]), the fields size and data, through the > passed ptr. >=20 > A function that is passed 'struct string *' can still be allowed to > inplace-modify data[i] by invoking appropriate string APIs that temporari= ly > casts away the constness of data[i]. That would be a violation of const correctness. If necessary, yeah, go ahead. But personally, I'd prefer avoiding that. For something that is proposed as an inclusion to ISO C, I'd rather not have it. I would prefer that a libstring adds those APIs, rather than having them in the standard. They aren't uncontroversial. Neither char* is uncontroversial, but at least it's simple. > The SEI CERT coding standard for C allows for an exception for such > modifications. See > https://wiki.sei.cmu.edu/confluence/display/c/EXP05-C.+Do+not+cast+away+a= +const+qualification#:~:text=3Dend%20is%20undefined%20*/-,EXP05%2DC%2DEX3,-= %3A%20Because%20const >=20 > Some other options: > Make use of the sparse semantic checker. >=20 > It might help to name the fields with underscores to highlight the fact t= hat > they are internal fields. For e.g. 'str_size__' and 'str_data__'. Any us= e of > these fields outside of the dedicated API functions can now be easily > traced by simple grep searches. >=20 > It might also help to create a visible type such as >=20 > struct string { > const uintptr_t raw[2]; > }; >=20 > and let the API manage the translation between uintptr_t and size_t/char = *. But you'll need a transparent union to do the magic. All of that shows that it isn't an easy thing through structures. unions at least have it easier. Cheers, Alex --=20 Looking for a remote C programming job at the moment. --JJl885BjywXY1hu9 Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE6jqH8KTroDDkXfJAnowa+77/2zIFAmXDhaQACgkQnowa+77/ 2zJXDw//RteVUsiYte8LoKn6QEIGHE2e7nH6ssRWE+9v2rHVVj9Jqnira7bT3rXg jyyNtDlC7hMHlIYCjIccHxPYpDfCAX1gqjYV2tm3OShQBthlVlWhRJ9v/weSZDoN BSILb1bv8CRJzQRCwhSmYSWhLcjIeuQ2FssiWOlMKAs31uJwEtUOBBYf+tRdpPvL GZSLKbPOMSUnhtXnBkEeBGDqOnIhiC+CbT5H1tEy3hK9Kf5iTNK9GZ7fi5C1k4LF ltqk0mvj6py/pGZoHmPe5C/nwV3aydDc0J8w4iFwUG45n35K8zp5GI1SkZ4XGDNq EzsfHKUeS0XdEqhYMA+PjsHy1dDp/U6y9KHUXdw10jSiGhAF3nhRGm/DVmK0e0QL b7KL24uoAca8JGEuhKtV1slgMy+kaQn9iSDbSITEOtWHzAkizicdm1RN8UDc5xaR n8IfPKz3nl3Mb2yC/u4NJiip+eYHTvdBipMiFsa9S6Mc9BGTOQfczNhBWSwmiRQ/ MtFBh0+Vwm62YTdkd2zij5/gY8y2s0bsbT7yPecZXeqo3veEG4AHK4G2oiyi8TKY nq/BtMsGWCNi5GvRu0C2M1tnNO1ujbxvsRK8nq7n6XPsioXFZZ9wk9HPG1GOHPNS 0tNxy6ZdiI34ZH2oHFuqL699aQqBlZgsGWc6iCNoZfFAV/6NkM4= =f08P -----END PGP SIGNATURE----- --JJl885BjywXY1hu9--