Re: [Static Analyzer] Loop handling - False positive for malloc-sm

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

From: David Malcolm <dmalcolm@redhat.com>
To: Pierrick Philippe <pierrick.philippe@irisa.fr>, gcc@gcc.gnu.org
Subject: Re: [Static Analyzer] Loop handling - False positive for malloc-sm
Date: Wed, 22 Mar 2023 14:19:41 -0400	[thread overview]
Message-ID: <b285f0a74eb8417e0a541002fb207bbcd1b49a74.camel@redhat.com> (raw)
In-Reply-To: <805abf28-3991-df57-51b5-d1e1f4f398b6@irisa.fr>

On Tue, 2023-03-21 at 09:21 +0100, Pierrick Philippe wrote:
> On 21/03/2023 00:30, David Malcolm wrote:
> > On Mon, 2023-03-20 at 13:28 +0100, Pierrick Philippe wrote:
> > > Hi everyone,
> > > 
> > > I'm still playing around with the analyzer, and wanted to have a
> > > look
> > > at
> > > loop handling.
> > > I'm using a build from /trunk/ branch (/20230309/).
> > > 
> > > Here is my analyzed code:
> > > 
> > > '''
> > > 1| #include <stdlib.h>
> > > 2| int main(void) {
> > > 3|    void * ptr = malloc(sizeof(int));
> > > 4|    for (int i = 0; i < 10; i++) {
> > > 5|        if (i == 5) free(ptr);
> > > 6|    }
> > > 7|}
> > > '''
> [stripping]
> > > So, I'm guessing that this false positive is due to how the
> > > analyzer
> > > is
> > > handling loops.
> > > Which lead to my question: how are loops handled by the analyzer?
> > Sadly, the answer is currently "not very well" :/
> > 
> > I implemented my own approach, with a "widening_svalue" subclass of
> > symbolic value.  This is widening in the Abstract Interpretation
> > sense,
> > (as opposed to the bitwise operations sense): if I see multiple
> > values
> > on successive iterations, the widening_svalue tries to simulate
> > that we
> > know the start value and the direction the variable is moving in.
> > 
> > This doesn't work well; arguably I should rewrite it, perhaps with
> > an
> > iterator_svalue, though I'm not sure how it ought to work.  Some
> > ideas:
> > 
> > * reuse gcc's existing SSA-based loop analysis, which I believe can
> > identify SSA names that are iterator variables, figure out their
> > bounds, and their per-iteration increments, etc.
> > 
> > * rework the program_point or supergraph code to have a notion of
> > "1st
> > iteration of loop", "2nd iteration of loop", "subsequent
> > iterations",
> > or similar, so that the analyzer can explore those cases
> > differently
> > (on the assumption that such iterations hopefully catch the most
> > interesting bugs)


I've filed an RFE discussing some of the problems with -fanalyzer's
loop-handling here:

  https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109252

including the idea of making use of GCC's existing SSA-based loop
analysis (which discovers a tree of loops within each function's CFG).

> 
> I see, I don't know if you ever considered allowing state machines to
> deal with loops on their own.
> Such as having an API to allow to register a callback to handle
> loops, 
> but not in a mandatory way.
> Or having a set of APIs to optionally implement for the analyzer to
> call.

I hadn't thought of that, but it sounds like a reasonable idea.

> 
> It would allow state machines to analyze loops with the meaning of
> their 
> inner analysis.
> 
> Which could allow them to try to find a fixed point in the loop 
> execution which doesn't have
> any impact on the program state for that state machine. Kind of like
> a 
> custom loop invariant.
> Because depending of the analysis goal of the state machine, you
> might 
> need to symbolically execute the loop
> only a few times before reentering the loop and having the entry
> state 
> being the same as the end-of-loop state.

The analyzer performs symbolic execution; it tries to achieve a
reasonable balance between:
* precision of state tracking versus
* achieving decent coverage of code and data flow
* ensuring termination
via various heuristics.

Its current loop implementation uses widening_svalue and the complexity
limits on svalues/regions to attempt to have the symbolic execution
terminate due to hitting already-visited nodes in the exploded_graph,
or else hit per-program-point limits.  Unfortuately this often doesn't
work well.

GCC's optimization code has both GIMPLE and RTL loop analysis code. 
The RTL code runs too late for the analyzer, but the GIMPLE loop
analysis code is in cfgloop.{h,cc} and thus we would have access to
information about loops, at least for well-behaved cases - though
possibly only when optimization is enabled.

> 
> In fact, this could be done directly by the analyzer, and only
> calling 
> state machine APIs for loop handling which still has not reached
> such a fixed point in their program state for the analyzed loop, with
> a 
> maximum number of execution fixed by the analyzer to limit execution
> time.
> 
> Does what I'm saying make sense?

I think so, though I'm not sure how it would work in practice. 
Consider e.g. 

  for (int i = 0; i < n; i++)
     head = prepend_node (head, i);

which builds a chain of N dynamically-allocated nodes in a linked list.

> 
> In terms of implementation, loop detection can be done by looking for
> strongly connected components (SCCs)
> in a function graph having more than one node.
> I don't know if this is how it is already done within the analyzer or
> not?

It isn't yet done in the analyzer, but as noted above there is code in
GCC that already does that (in cfgloop.{h,cc}).

Dave

next prev parent reply	other threads:[~2023-03-22 18:19 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-20 12:28 Pierrick Philippe
2023-03-20 23:30 ` David Malcolm
2023-03-21  8:21   ` Pierrick Philippe
2023-03-22 18:19     ` David Malcolm [this message]
2023-03-23  8:06       ` Pierrick Philippe
2023-03-21 10:01   ` Shengyu Huang
2023-03-22 18:34     ` David Malcolm
2023-03-21 10:12   ` Shengyu Huang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b285f0a74eb8417e0a541002fb207bbcd1b49a74.camel@redhat.com \
    --to=dmalcolm@redhat.com \
    --cc=gcc@gcc.gnu.org \
    --cc=pierrick.philippe@irisa.fr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).