From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by sourceware.org (Postfix) with ESMTPS id D6F653858D3C for ; Thu, 16 Mar 2023 13:44:46 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org D6F653858D3C Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1678974286; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=IeM+SwvJRQGRyHP+qo8Lrrijijk9DMswlh0knSu9IyM=; b=Vwm5NlzJLRoUBCfelIGpWq5aUFJye+WgjT8qr38mvzL33jpo/0UJ32cQuDXNBbmxRUkPtd qxw8x56jC9xRe6MXovUsEnRKJZtMt3Zh4y3fM4UUpKdsI+aBnZ1IJLjvXSLUgrRMyFNzX7 2Udq+s6koEcsAQq3Hhs6nRJgDprNJFA= Received: from mail-qv1-f71.google.com (mail-qv1-f71.google.com [209.85.219.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-592-5fRh_UR5MJe31i0Oo7SosQ-1; Thu, 16 Mar 2023 09:44:45 -0400 X-MC-Unique: 5fRh_UR5MJe31i0Oo7SosQ-1 Received: by mail-qv1-f71.google.com with SMTP id e1-20020a0cd641000000b005b47df84f6eso1109597qvj.0 for ; Thu, 16 Mar 2023 06:44:45 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1678974284; h=mime-version:user-agent:content-transfer-encoding:references :in-reply-to:date:to:from:subject:message-id:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=IeM+SwvJRQGRyHP+qo8Lrrijijk9DMswlh0knSu9IyM=; b=hVieGKWQ+7l+vp3uG2/ZKPzPHiDXpAk3eNRMDOfgQTIfDm8JQABfx2qP6KeoSJ+zYL bGBfMLTHG8P0GNOrhpMEciTuFzINGEsdh0PgUewRWLDOdhs3L7dw8d6P34hnWxOCIT30 OFUhRh1qpae3orcyy4WuURYqtAEGzifAFYMLnH4PscAiI02sqfPeFrpvv5siho+qFKwa uAeAT/m6rt7ctKZ4Op+p9228iKcU8F0cjTI7aefUkcOsWbYd2Xc6FLXsBVIBw0P0snoO 7V6EgerSnR3vhPLtHLvUUJSN+g1bEvOhCq0wzFZfrUgPI3577dlDW4iEgbXBYJ9H7ZVc zuuA== X-Gm-Message-State: AO0yUKW0jkMViRvaydx5ELh6kpNLiyEiw6CeFfOY7FTDmT4V+18TuKZz KriW75PbaNtoHoLTiYLs1cJuT042MmyzOOjv5VeDCOXRHc5aguM9o3AQxpQ6DY5lqsxcCUKiFEG nV7sn0Eo= X-Received: by 2002:a05:6214:29c1:b0:56e:a9d4:429b with SMTP id gh1-20020a05621429c100b0056ea9d4429bmr31546912qvb.1.1678974284674; Thu, 16 Mar 2023 06:44:44 -0700 (PDT) X-Google-Smtp-Source: AK7set9BG4I/of3V8/IJOyOm1QkMLMXTeQ1SUbQ5wJAJwZjy9bGWTRpJPUka2liDdJlV+ykfZy852Q== X-Received: by 2002:a05:6214:29c1:b0:56e:a9d4:429b with SMTP id gh1-20020a05621429c100b0056ea9d4429bmr31546872qvb.1.1678974284280; Thu, 16 Mar 2023 06:44:44 -0700 (PDT) Received: from t14s.localdomain (c-73-69-212-193.hsd1.nh.comcast.net. [73.69.212.193]) by smtp.gmail.com with ESMTPSA id 74-20020a370a4d000000b007456b51ee13sm6021754qkk.16.2023.03.16.06.44.43 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 16 Mar 2023 06:44:43 -0700 (PDT) Message-ID: <39bb628307f65983b27d0d8724e9a9362c788774.camel@redhat.com> Subject: Re: Static Analyzer: correlate state for different region/svalue for reporting From: David Malcolm To: Pierrick Philippe , gcc@gcc.gnu.org Date: Thu, 16 Mar 2023 09:44:42 -0400 In-Reply-To: References: <98024bfba94a9dc306475fd2753bdcb55961dada.camel@redhat.com> User-Agent: Evolution 3.44.4 (3.44.4-1.fc36) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,BODY_8BITS,DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Thu, 2023-03-16 at 09:54 +0100, Pierrick Philippe wrote: > On 15/03/2023 17:26, David Malcolm wrote: > > On Wed, 2023-03-15 at 16:24 +0100, Pierrick Philippe wrote: [...snip...] > >=20 > >=20 > > An ana::svalue is a pattern of bits (possibly symbolic, such as > > "the > > constant 42" or "the initial value of param x") > >=20 > > An ana::region is a description of a memory location for > > reads/writes > > (possibly symbolic, such as "local var x within this frame", or > > "the > > block reached by dereferencing the initial value of ptr in the > > frame > > above"). > >=20 > > Sorry if I've misread things; I'll try to answer the rest of the > > email > > as best as I can... > >=20 > > > So, first question: is there any way to associate and track the > > > state > > > of > > > a rvalue, independently of its lvalue? > > >=20 > > > To try to clarify the question, here's an example: > > >=20 > > > ''' > > > int __attribute__("__some_attribute__") x =3D 42; > > > /* STEP-1 > > > =C2=A0=C2=A0From now on, we consider the state of x as being marked b= y > > > some_attribute. > > > But in fact, in the log, we can observe that we'll have something > > > like > > > this in the new '/ana::program_state/': > > > {0x4b955b0: (int)42: marked_state (=E2=80=98x=E2=80=99)} */ > > Yes: sm-state is associated within a program_state with an svalue, > > in > > this case with the constant 42. > >=20 > > There isn't yet a way to instead give sm-state to "x" itself.=C2=A0 I > > suppose you could give sm-state to &x (the pointer that points to x > > is > > an instance of region_svalue, which is an svalue), but I haven't > > tried > > this approach myself. > >=20 > > > int *y =3D &x; > > > /* STEP-2 > > > For analysis purpose, you want to know that from now on, y is > > > pointing > > > to marked data. > > > So you set state of the LHS of the GIMPLE statement (i.e. some > > > ssa_name > > > instance of y) accordingly, with a state named 'points- > > > to_marked_data' > > > and setting 'x' as the origin of the state (in the sense of the > > > argument > > > /origin/ from '/ana::sm_context::on_transition/'. > > > What we now have in the new '/ana::program_state/' is this: > > > {0x4b9acb0: &x: points-to-marked_data (=E2=80=98&x=E2=80=99) (origin:= 0x4b955b0: > > > (int)42 > > > (=E2=80=98x=E2=80=99)), 0x4b955b0: (int)42: marked_state (=E2=80=98x= =E2=80=99)} */ > > Yes: you've set the state on the svalue "&x", not on "y". > >=20 > > > int z =3D *y; > > > /* STEP-3 > > > Now you have implicit copy of marked data, and you want to report > > > it. > > > So you state the LHS of the GIMPLE statement (i.e. some ssa_name > > > instance of z) as being marked, with 'y' as the origin. > > > What we now have in the new '/ana::program_state/' is this: > > > {0x4b9acb0: &x: points-to-marked_data (=E2=80=98&x=E2=80=99) (origin:= 0x4b955b0: > > > (int)42 > > > (=E2=80=98x=E2=80=99)), 0x4b955b0: (int)42: marked_state (=E2=80=98x= =E2=80=99)} */ > > > ''' > > Presumably the program_state also shows that you have a binding for > > the > > region "z", containing the svalue 42 (this is represented within > > the > > "store", within the region_model within the program_state). >=20 > Indeed, there is a binding for region "z" to the svalue 42 within the > new program state. >=20 > > > In STEP-2: > > >=20 > > > We lost the information saying that the rvalue of y (i.e. y), is > > > pointing to marked data. > > > Only remain the information that its lvalue (i.e. &x), is > > > pointing to > > > marked data, which is of course true. > > Note that the analyzer by default attempts to purge state that it > > doesn't think will be needed anymore, so it may have purged the > > state > > of "y" if y isn't going to be read from anymore.=C2=A0 You could try - > > fno- > > analyzer-state-purge to avoid this purging. >=20 > Nothing changes when I run it with the -fno-anlyzer-state-purge, > it is still the state of &x which is tracked. >=20 > > > In STEP-3: > > >=20 > > > No information is added regarding z in the new program_state. > > > In fact, if one have a closer look at the log, we see that the > > > LHS of > > > the GIMPLE statement (the ssa_name instance of z), is already in > > > the > > > state of 'marked_data'. > > > Through the log to the call to '/sm_context::on_transition/' this > > > can > > > be > > > seen: > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 marked_sm: state transition of =E2=80= =98z_5=E2=80=99: marked_data -> > > > marked_data > > >=20 > > > All of this step is somehow leading to confusing diagnostic. > > > For example, you miss the fact that 'z' is being marked, because > > > no > > > event happen as it is somehow aliasing 'x' svalue. > > > Though it might not be true in case of missed optimization. > > >=20 > > > Of course, if anyone wants to have a look at my code, I can > > > provide > > > it > > > to you (it is not yet publicly available at the moment). > > I think I'd have to look at the code to be of more help; I confess > > that > > I stopped understanding somewhere around step 3, sorry. > No worries, the idea is that regarding the state_map, z_3 is already=20 > considered as marked. > > Are you able to post a simple example of the code you'd like to > > analyze, with an idea of the behavior you'd like from the analyzer? > > (e.g. the state transitions) >=20 > Sure here is an exemple: >=20 > ''' > int __attribute__("__some_attribute__") x =3D 42; > // Transition state for x: start -> marked > int *y =3D &x; > // Transition state for y: start -> points-to_marked > int z =3D *y; > // Transition state for z: start -> marked > if (z); > // Diagnostic is issued because of the use of a marked data in a=20 > condition, for analysis purpose. > ''' >=20 > The issued diagnostic I do have in mind looks like: >=20 > ''' > 1 | int __attribute__("__some_attribute__") x =3D 42; > =C2=A0=C2=A0=C2=A0 | ^ > =C2=A0=C2=A0=C2=A0 | | > =C2=A0=C2=A0=C2=A0 | (1) =E2=80=98x=E2=80=99 gets marked here > 2 | int *y =3D &x; > =C2=A0=C2=A0=C2=A0 | =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ~ > =C2=A0=C2=A0=C2=A0 |=C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 | > =C2=A0=C2=A0=C2=A0 | =C2=A0=C2=A0=C2=A0=C2=A0 (2) =E2=80=98y=E2=80=99 poi= nts-to marked data here > 3 | int z =3D *y; > =C2=A0=C2=A0=C2=A0 |=C2=A0=C2=A0=C2=A0=C2=A0 ~ > =C2=A0=C2=A0=C2=A0 | =C2=A0=C2=A0=C2=A0=C2=A0 | > =C2=A0=C2=A0=C2=A0 |=C2=A0=C2=A0=C2=A0 (3) =E2=80=98z=E2=80=99 gets marke= d here > 4 | if (z); > =C2=A0=C2=A0=C2=A0 | =C2=A0 =C2=A0 ~ > =C2=A0=C2=A0=C2=A0 |=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 | > =C2=A0=C2=A0=C2=A0 |=C2=A0=C2=A0=C2=A0 (4) used of a marked value 'z' in = a condition > ''' >=20 > What I get from now on is this: >=20 > ''' > 1 |=C2=A0=C2=A0=C2=A0=C2=A0 int __attribute__((__taint__)) x =3D 42; > =C2=A0=C2=A0=C2=A0 | =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0 =C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ^ > =C2=A0=C2=A0=C2=A0 |=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=C2= =A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 | > =C2=A0=C2=A0=C2=A0 | =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 (1) =E2=80=98x=E2=80=99 gets= =20 > tainted here > =C2=A0=C2=A0=C2=A0 |...... > 4 |=C2=A0=C2=A0=C2=A0=C2=A0 if(z) return 666; > =C2=A0=C2=A0=C2=A0 |=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ~ > =C2=A0=C2=A0=C2=A0 |=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 | > =C2=A0=C2=A0=C2=A0 |=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 (2) use of taint= ed value =E2=80=98z=E2=80=99 in a condition > ''' If I'm reading things right, it looks like the analyzer is correctly complaining that the marked svalue is being used in a condition, but the problem is that the events in the diagnostic_path aren't very good: it's missing events (2) and (3) from the "what you had in mind" diagnostic. If I'm right, then the issue here seems to be more about how we generate the diagnostic_path (the list of event messages displayed for the diagnostic) when we emit the saved_diagnostic, rather than about the state-tracking. Unfortunately, when I wrote that code I was primarily looking at state machines for tracking allocation state (e.g. double-free), and it doesn't work well with other kinds of state machine. I've experimented with other approaches that might work better. I think ultimately we want to be able to walk backwards from a saved_diagnostic and be able to answer the question "why did it get into this state?", which might involve following assignments backwards. I have a semi-working prototype of this, but it would be a lot of work to get ready for trunk (it tries to record a set of reversible "change" objects, so that we can walk backwards from the diagnostic, undoing changes to see where a value came from). >=20 > And for some reason, if I change the analyzed code to this: >=20 > ''' > int __attribute__("__some_attribute__") x =3D 42; > int *y =3D &x; > x =3D 6; // New line > int z =3D *y; > if (z); > ''' >=20 > This is what I get: >=20 > ''' > 2 |=C2=A0=C2=A0=C2=A0=C2=A0 int __attribute__((__taint__)) x =3D 42; > =C2=A0=C2=A0=C2=A0 |=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 ^ > =C2=A0=C2=A0=C2=A0 |=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 | > =C2=A0=C2=A0=C2=A0 |=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 (1) =E2=80= =98x=E2=80=99 gets > marked=20 > here > 3 |=C2=A0=C2=A0=C2=A0=C2=A0 int * y =3D &x; > =C2=A0=C2=A0=C2=A0 |=C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 ~ > =C2=A0=C2=A0=C2=A0 | =C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 | > =C2=A0=C2=A0=C2=A0 | =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0 (2) =E2=80=98&x=E2=80=99 points to tainted data here > 4 |=C2=A0=C2=A0=C2=A0=C2=A0 x =3D 6; > 5 |=C2=A0=C2=A0=C2=A0=C2=A0 int z =3D *y; > =C2=A0=C2=A0=C2=A0 |=C2=A0=C2=A0=C2=A0=C2=A0 =C2=A0 =C2=A0=C2=A0 ~ > =C2=A0=C2=A0=C2=A0 |=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 | > =C2=A0=C2=A0=C2=A0 |=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 (3) = =E2=80=98x=E2=80=99 gets tainted here > 6 |=C2=A0=C2=A0=C2=A0=C2=A0 if(z) return 666; > =C2=A0=C2=A0=C2=A0 |=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ~ > =C2=A0=C2=A0=C2=A0 | =C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 | > =C2=A0=C2=A0=C2=A0 |=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 (4) use of taint= ed value =E2=80=98z=E2=80=99 in a condition That event (3) looks bogus; looks like a bug in how we generate events (the existing version of the code I was talking about above). > ''' >=20 > Sorry for the long mail here, No worries; sorry about the bugs in the analyzer :/ > Let me know if you want to have a look at the exploded graph or at > the=20 > analyzer's log. Does the above help? Dave