public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed
From: Pierrick Philippe <pierrick.philippe@irisa.fr>
To: David Malcolm <dmalcolm@redhat.com>, gcc@gcc.gnu.org
Subject: Re: Static Analyzer: correlate state for different region/svalue for reporting
Date: Thu, 16 Mar 2023 09:54:09 +0100	[thread overview]
Message-ID: <ff85e786-4401-5735-9f4d-8b1f7bd3601a@irisa.fr> (raw)
In-Reply-To: <98024bfba94a9dc306475fd2753bdcb55961dada.camel@redhat.com>

On 15/03/2023 17:26, David Malcolm wrote:
> On Wed, 2023-03-15 at 16:24 +0100, Pierrick Philippe wrote:
>> Hi everyone,
> Hi Pierrick
>
>> I have some question regarding the analyzer.
>> I'm currently working on an fully out-of-tree static analyzer plugin.
>> I started development on commit tagged as /basepoints/gcc-13/, but
>> recently moved my code to a more recent /trunk/ branch (build
>> /20230309/).
>>
>>   From my different experiments and analyzer's log readings, I have
>> the
>> intuition that '/ana::state_machine::state/' class is mostly (if not
>> only) linked with '/ana::svalue/',
> Yes: a program_state (which describes the state at a particular node
> along an execution path) contains (among other things) an sm_state_map
> per state_machine, and the sm_state_map maps from svalues to
> state_machine::states.
>
>> i.e. the lvalue of a memory region.
>> In my analysis case, I would like to also be able to track state for
>> some rvalue of a memory region, i.e. '/ana::region/' objects.
> You're using the terms "lvalue" and "rvalue", but if I'm reading your
> email right, you're using them in exactly the opposite way that I
> understand them.
I apologize, you are completely right, I swapped their usage...
>
> I think of an "lvalue" as something that can be on the left-hand side
> of an assignment: something that can be written to (an ana::region),
> whereas an "rvalue" is something that can be on the right-hand side of
> an assignment: a pattern of bits (an ana:svalue) that be written into
> an lvalue.
>
> An ana::svalue is a pattern of bits (possibly symbolic, such as "the
> constant 42" or "the initial value of param x")
>
> An ana::region is a description of a memory location for reads/writes
> (possibly symbolic, such as "local var x within this frame", or "the
> block reached by dereferencing the initial value of ptr in the frame
> above").
>
> Sorry if I've misread things; I'll try to answer the rest of the email
> as best as I can...
>
>> So, first question: is there any way to associate and track the state
>> of
>> a rvalue, independently of its lvalue?
>>
>> To try to clarify the question, here's an example:
>>
>> '''
>> int __attribute__("__some_attribute__") x = 42;
>> /* STEP-1
>>   From now on, we consider the state of x as being marked by
>> some_attribute.
>> But in fact, in the log, we can observe that we'll have something
>> like
>> this in the new '/ana::program_state/':
>> {0x4b955b0: (int)42: marked_state (‘x’)} */
> Yes: sm-state is associated within a program_state with an svalue, in
> this case with the constant 42.
>
> There isn't yet a way to instead give sm-state to "x" itself.  I
> suppose you could give sm-state to &x (the pointer that points to x is
> an instance of region_svalue, which is an svalue), but I haven't tried
> this approach myself.
>
>> int *y = &x;
>> /* STEP-2
>> For analysis purpose, you want to know that from now on, y is
>> pointing
>> to marked data.
>> So you set state of the LHS of the GIMPLE statement (i.e. some
>> ssa_name
>> instance of y) accordingly, with a state named 'points-
>> to_marked_data'
>> and setting 'x' as the origin of the state (in the sense of the
>> argument
>> /origin/ from '/ana::sm_context::on_transition/'.
>> What we now have in the new '/ana::program_state/' is this:
>> {0x4b9acb0: &x: points-to-marked_data (‘&x’) (origin: 0x4b955b0:
>> (int)42
>> (‘x’)), 0x4b955b0: (int)42: marked_state (‘x’)} */
> Yes: you've set the state on the svalue "&x", not on "y".
>
>> int z = *y;
>> /* STEP-3
>> Now you have implicit copy of marked data, and you want to report it.
>> So you state the LHS of the GIMPLE statement (i.e. some ssa_name
>> instance of z) as being marked, with 'y' as the origin.
>> What we now have in the new '/ana::program_state/' is this:
>> {0x4b9acb0: &x: points-to-marked_data (‘&x’) (origin: 0x4b955b0:
>> (int)42
>> (‘x’)), 0x4b955b0: (int)42: marked_state (‘x’)} */
>> '''
> Presumably the program_state also shows that you have a binding for the
> region "z", containing the svalue 42 (this is represented within the
> "store", within the region_model within the program_state).

Indeed, there is a binding for region "z" to the svalue 42 within the 
new program state.

>> In STEP-2:
>>
>> We lost the information saying that the rvalue of y (i.e. y), is
>> pointing to marked data.
>> Only remain the information that its lvalue (i.e. &x), is pointing to
>> marked data, which is of course true.
> Note that the analyzer by default attempts to purge state that it
> doesn't think will be needed anymore, so it may have purged the state
> of "y" if y isn't going to be read from anymore.  You could try -fno-
> analyzer-state-purge to avoid this purging.

Nothing changes when I run it with the -fno-anlyzer-state-purge,
it is still the state of &x which is tracked.

>> In STEP-3:
>>
>> No information is added regarding z in the new program_state.
>> In fact, if one have a closer look at the log, we see that the LHS of
>> the GIMPLE statement (the ssa_name instance of z), is already in the
>> state of 'marked_data'.
>> Through the log to the call to '/sm_context::on_transition/' this can
>> be
>> seen:
>>       marked_sm: state transition of ‘z_5’: marked_data -> marked_data
>>
>> All of this step is somehow leading to confusing diagnostic.
>> For example, you miss the fact that 'z' is being marked, because no
>> event happen as it is somehow aliasing 'x' svalue.
>> Though it might not be true in case of missed optimization.
>>
>> Of course, if anyone wants to have a look at my code, I can provide
>> it
>> to you (it is not yet publicly available at the moment).
> I think I'd have to look at the code to be of more help; I confess that
> I stopped understanding somewhere around step 3, sorry.
No worries, the idea is that regarding the state_map, z_3 is already 
considered as marked.
> Are you able to post a simple example of the code you'd like to
> analyze, with an idea of the behavior you'd like from the analyzer?
> (e.g. the state transitions)

Sure here is an exemple:

'''
int __attribute__("__some_attribute__") x = 42;
// Transition state for x: start -> marked
int *y = &x;
// Transition state for y: start -> points-to_marked
int z = *y;
// Transition state for z: start -> marked
if (z);
// Diagnostic is issued because of the use of a marked data in a 
condition, for analysis purpose.
'''

The issued diagnostic I do have in mind looks like:

'''
1 | int __attribute__("__some_attribute__") x = 42;
    | ^
    | |
    | (1) ‘x’ gets marked here
2 | int *y = &x;
    |       ~
    |        |
    |      (2) ‘y’ points-to marked data here
3 | int z = *y;
    |     ~
    |      |
    |    (3) ‘z’ gets marked here
4 | if (z);
    |     ~
    |      |
    |    (4) used of a marked value 'z' in a condition
'''

What I get from now on is this:

'''
1 |     int __attribute__((__taint__)) x = 42;
    |                                                 ^
    |                                                  |
    |                                                (1) ‘x’ gets 
tainted here
    |......
4 |     if(z) return 666;
    |        ~
    |         |
    |       (2) use of tainted value ‘z’ in a condition
'''

And for some reason, if I change the analyzed code to this:

'''
int __attribute__("__some_attribute__") x = 42;
int *y = &x;
x = 6; // New line
int z = *y;
if (z);
'''

This is what I get:

'''
2 |     int __attribute__((__taint__)) x = 42;
    |                                                 ^
    |                                                  |
    |                                                (1) ‘x’ gets marked 
here
3 |     int * y = &x;
    |            ~
    |             |
    |           (2) ‘&x’ points to tainted data here
4 |     x = 6;
5 |     int z = *y;
    |          ~
    |           |
    |         (3) ‘x’ gets tainted here
6 |     if(z) return 666;
    |        ~
    |         |
    |       (4) use of tainted value ‘z’ in a condition
'''

Sorry for the long mail here,
Let me know if you want to have a look at the exploded graph or at the 
analyzer's log.

Thank you for your time,

Pierrick


  reply	other threads:[~2023-03-16  8:54 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-15 15:24 Pierrick Philippe
2023-03-15 16:26 ` David Malcolm
2023-03-16  8:54   ` Pierrick Philippe [this message]
2023-03-16 13:44     ` David Malcolm
2023-03-21 14:36       ` Pierrick Philippe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ff85e786-4401-5735-9f4d-8b1f7bd3601a@irisa.fr \
    --to=pierrick.philippe@irisa.fr \
    --cc=dmalcolm@redhat.com \
    --cc=gcc@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).