From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 2D5C33858C66; Mon,  6 Mar 2023 19:15:02 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 2D5C33858C66
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1678130102;
	bh=qmKp32gNB2/Uc/4iZiPXX9WZja8ImPaXy8gq1LG1Dco=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=MrVSH3+ll78Z2HwJF3tucVsznfKiDRnlBybE9sAKn4BSEw2EjzUWX+n6KVMhrhlJ1
	 6/SM2tYDtDvCXxvK6OqVHNc/AQM2KXrikOHEcK6+ZRrqC2j0ayArorE57CMc8ZeA6z
	 uWcMO+TyPgSONtpTCq64ddfh9K0pLEf2NfEaJ8lo=
From: "isanbard at gmail dot com" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug c/108896] provide "element_count" attribute to give more
 context to __builtin_dynamic_object_size() and -fsanitize=bounds
Date: Mon, 06 Mar 2023 19:15:00 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: c
X-Bugzilla-Version: unknown
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: enhancement
X-Bugzilla-Who: isanbard at gmail dot com
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: qinzhao at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-108896-4-wH59mj1AWR@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-108896-4@http.gcc.gnu.org/bugzilla/>
References: <bug-108896-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D108896
--- Comment #18 from Bill Wendling <isanbard at gmail dot com> ---
(In reply to Martin Uecker from comment #17)
> Am Freitag, dem 03.03.2023 um 23:18 +0000 schrieb isanbard at gmail dot c=
om:
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D108896
> >=20
> > --- Comment #16 from Bill Wendling <isanbard at gmail dot com> ---
> > (In reply to Martin Uecker from comment #15)
> > > Am Freitag, dem 03.03.2023 um 20:27 +0000 schrieb isanbard at gmail d=
ot com:
> > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D108896
> > > >=20
> > > > --- Comment #14 from Bill Wendling <isanbard at gmail dot com> ---
> > > > (In reply to Martin Uecker from comment #9)
> > > > > > > Considering that the GNU extensions is rarely used, one could=
 consider
> > > > > > > redefining the meaning of
> > > > > > >=20
> > > > > > > int n =3D 1;
> > > > > > > struct {
> > > > > > > =C2=A0=C2=A0int n;
> > > > > > > =C2=A0=C2=A0char buf[n];
> > > > > > > };
> > > > > > >=20
> > > > > > > so that the 'n' refers to the member. Or we add a new syntax =
similar to
> > > > > > > designators (which intuitively makes sense to me).
> > > > > > designator might be better IMO.
> > > > > >=20
> > > > > > a question here is:
> > > > > >=20
> > > > > > for the following nested structure:=20
> > > > > >=20
> > > > > > struct object {
> > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0...
> > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0char items;
> > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0...
> > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0struct inner {
> > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0...
> > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0int flex[];
> > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0};
> > > > > > } *ptr;
> > > > > >=20
> > > > > > what kind of syntax is good to represent the upper bound of "fl=
ex" in the inner
> > > > > > struct with "items" in the outer structure? any suggestion?
> > > > >=20
> > > > > I would disallow it. At least at first. It also raises some
> > > > > questions: For example, one could form a pointer to the inner
> > > > > struct, and then it is not clear how 'items' could be accessed
> > > > > anymore.
> > > > >=20
> > > >=20
> > > > That would be limiting its use in the Linux kernel. It seems that t=
here are
> > > > ways to refer to struct members already using something like "offse=
tof":
> > > >=20
> > > > struct object {
> > > > =C2=A0=C2=A0=C2=A0=C2=A0...
> > > > =C2=A0=C2=A0=C2=A0=C2=A0char items;
> > > > =C2=A0=C2=A0=C2=A0=C2=A0...
> > > > =C2=A0=C2=A0=C2=A0=C2=A0struct inner {
> > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0...
> > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0int flex[] __attrib=
ute__((__element_count__(offsetof(struct object,
> > > > items))));
> > > > =C2=A0=C2=A0=C2=A0=C2=A0};
> > > > } *ptr;
> > >=20
> > > This seems to be something completely different. offsetof
> > > computes the offset from the type given in its argument.
> > > But it would not access the value of the member of the
> > > enclosing struct. But it would not work in your example,
> > > because the struct object is incomplete at this point.
> > >=20
> > > So no, you can not use offsetof to reference a member
> > > of an enclosing struct.
> > >=20
> > "offsetof(struct foo, count)" is a fancy wrapper for "((struct foo
> > *)0)->count", which is resolved during sema, where it does have the full
> > structure definition. For instance, this compiles in C++:
> >=20
> > struct a {
> > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0int count;
> > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0int y =3D ((struct a *)=
0)->count;
> > } x;
> >=20
> > void foo(struct a *);
>=20
> This is not the case in C: https://godbolt.org/z/P7M6cdnoa
>=20
Right, it's not legal C syntax. I wanted to point out that it's not an
impossible syntax. The lexer and parser have no idea of, nor should they ca=
re
about, legal types, complete struct definitions, etc. All it sees is a memb=
er
access. It's semantic analysis that actually determines whether or not it's
correct, and if so what it means.

> But even we make it resolve the name, which we
> have to do for all proposals, it is something  different.
>=20
> If offsetof it would resolve the count of a different
> struct of the same *type* (here a non-existent one at=C2=A0
> address zero). Here we need a self reference to the
> same *object*.
>=20
If we make it an attribute, we have more control over what the arguments me=
an,
what they refer to, etc.

For the record, I'm not opposed to the idea of using the designated initial=
izer
syntax. I just think it's good to explore other ideas before settling on
extending C (which is a bit heavy-handed).

> > > But I am not saying we shouldn't have the attribute first.
> > >=20
> > I personally prefer using an attribute than the suggestion to use some =
other
> > syntax, like:
> >=20
> > struct foo {
> > =C2=A0=C2=A0=C2=A0=C2=A0int fam[.count];
> > };
> >=20
> > It becomes less intuitive what's going on here. And might conflict with=
 VLA's
> > in structures.
>=20
> The syntax with the dot would make it not conflict.  But I need
> this for this use case
>=20
> struct foo {
>   int count;
>   int (*buf)[.count];
> };
>=20
> so that ARRAY_SIZE(*foo->buf) works correctly and also accesses
> to foo->buf are bounds checkked.  So it would make sense to=C2=A0
> solve to treat flexible array members in the same way.
>=20
> But I agree that we should simply add the attribute now also
> because it makes it possible to use it for existing code bases.
>=20
Agreed. I believe we're on the same page (despite what I may write here :-).

> > > > It also has the benefit of
> > > > allowing one to reference a variable not in the structure:
> > > >=20
> > > > const int items;
> > > > struct object {
> > > > =C2=A0=C2=A0=C2=A0=C2=A0...
> > > > =C2=A0=C2=A0=C2=A0=C2=A0char items;
> > > > =C2=A0=C2=A0=C2=A0=C2=A0...
> > > > =C2=A0=C2=A0=C2=A0=C2=A0struct inner {
> > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0...
> > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0int flex[] __attrib=
ute__((__element_count__(items))); /* References
> > > > global "items" */
> > > > =C2=A0=C2=A0=C2=A0=C2=A0};
> > > > } *ptr;
> > >=20
> > > Whether you allow this or not has nothing to do with the syntax.
> > >=20
> > > The question is what semantics you attach to this and this is
> > > also a question in your example.=20
> > >=20
> > > If you define
> > >=20
> > > struct inner* a =3D ...
> > >=20
> > > What does it say for  a->flex  ?
> > >=20
> > I need to point out that I used "offsetof" only as an example. It's pos=
sible to
> > create something more robust which will carry along type information, e=
tc.,
> > which the current offsetof macro throws away. I should have made that c=
lear.
> >=20
> > The sanitizers that see "a->flex" will try to find the correct variable=
. If
> > they can't, then they won't generate a check. In the case of it referen=
cing a
> > non-field member, it'll use that if it's within scope. If it refers to =
a field
> > member of a parent container that's not within scope, it'll also not ge=
nerate a
> > check. It's unfortunate that these checks are done as a "best effort," =
but it
> > could lead to software changes to improve security checks (like passing=
 a
> > parent structure into a function rather than an inner structure.
>=20
> Yes. We could also have an optional warning which warns about accessing
> 'flex' in a context where 'items' is not accessible.  My point is that
> this feature of potentially referring to stuff which may not be accessible
> in all cases makes implementation more complicated.
>=20
It does make it more complicated, that's true.

I haven't seen comments on Kees's first example, where "malloc" returns an
"__alloc_size" hint that's lost when "p" is returned from the function (at
least in Clang). If that information is kept around, it would expand the nu=
mber
of bounds checks we could perform. (Kudos if this is currently the case in
GCC.)

-bw=