public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed
From: David Malcolm <dmalcolm@redhat.com>
To: Shengyu Huang <kumom.huang@gmail.com>
Cc: GCC Development <gcc@gcc.gnu.org>
Subject: Re: [GSoC][Static Analyzer] Ideas for proposal
Date: Mon, 13 Mar 2023 11:51:47 -0400	[thread overview]
Message-ID: <3dfad33dec50c9f8bfb13e42a29cfb41b6aab457.camel@redhat.com> (raw)
In-Reply-To: <4CBE37A2-7D50-4ECC-9B70-951AB7176D9B@gmail.com>

On Sun, 2023-03-12 at 23:20 +0100, Shengyu Huang wrote:
> Hi Dave,
> 
> > > 
> > > 4. What’s the most interesting to me are PR103533
> > > (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103533),
> > 
> > Turning on taint detection by default would be a great project.  It
> > would be good to run the integration tests:
> >  https://github.com/davidmalcolm/gcc-analyzer-integration-tests
> > to see if anything regresses, or if it adds noise - so this might
> > be a
> > bit of an open-ended project, in that we'd want to fix whatever
> > issues
> > show up there, as well as the known ones that are documented in
> > that
> > bug.
> > 
> 
> Sorry for replying to you late due to another project from my
> university. 
> 
> Since most other ideas are being worked on by you or not big enough
> to make a GSoC project, I decided to take on this project and have
> been getting familiar with the analyzer this weekend. 

Excellent; thanks.

> I want to sort several things out before writing the proposal.
> 
> 1. What should I do with the integration tests?

First of all, AFAIK I'm the only person who's tried running the
integration tests.  They're the test scripts I wrote to help me
validate my own patches, so there will be rough edges; please let me
know as you run into them, so I can fix/document them.

I have scripts that run the integration test's test.py, passing in the
path to the built gcc and a "run directory" where the builds happen; I
do this for a "control" build of gcc, and for an "experiment" build
that has a patch (each with their own run directory).  This script
attempts to use that gcc to build the various projects, capturing the
diagnostics as lots of little .sarif files in the build dir.

One of these run directories takes about 17G of drive space, and takes
about an hour for me on a fast machine I have (64 cores).  We'll
probably need to get you set up with an account on the gcc compile
farm, which has lots of powerful machines that you can ssh into, unless
your university has something powerful you can use (with plenty of
cores, RAM, and free disk space, e.g. at least 60G of disk)

I then have a script that runs compare-by-warning.py, passing in the
paths to the two run dirs; this recurses through the two rundirs,
loading the .sarif files, and attempts to compare the before vs after
diagnostics.  I've attempted to classify the results I've seen via the
known-issues/*.txt files, so that the comparison has some knowledge
about whether the changes we've seen are e.g.:
- new false positives vs 
- new true positives vs
- false positives going away vs 
- true positives going away 
(etc)

That said, the "Juliet" results are currently rather unwieldy (many
more results than for the other projects, and 9.1G of the 17G by
space), so I tend to move them out of the way before doing the
comparison.

> 
> 2. I ran gcc -fanalyzer -fanalyzer-checker=taint ./gcc-
> src/gcc/testsuite/gcc.dg/analyzer/pr93032-mztools-signed-char.c , but
> I got different results from what you documented in PR103533:
> 
> /usr/bin/ld: /lib/x86_64-linux-gnu/crt1.o: in function `_start':
> (.text+0x17): undefined reference to `main'
> collect2: error: ld returned 1 exit status

gcc's default is to try to compile, assemble, and link into an
executable.  This testcase doesn't have a "main" function, hence the
linker complains.  If you pass "-S", it will merely compile the .c to a
.s assembler file whilst still running the analyzer.

In terms of actually running the test suite via DejaGnu, see:
https://gcc-newbies-guide.readthedocs.io/en/latest/working-with-the-testsuite.html

I typically use:

  make -k -jN \
  && time make check-gcc \
         RUNTESTFLAGS="-v -v --target_board=unix\{-m32,-m64\} analyzer-torture.exp=*.c analyzer.exp=*.c"

when testing the analyzer regression test suite, where N is the number
of cores on my box

When I run an individual testcase, I do something like:

./xgcc -B. -S -fanalyzer ../../src/PATH_TO_TEST_CASE

in the "gcc" subdirectory of the build directory.


> 
> 3. What does “ICE” mean when you said “ICE in alt_get_inherited_state
> in abs-1.c, …”?

ICE is our jargon for "internal compiler error" i.e. a crash of gcc
itself.

> 
> 4. For the following program, nothing is reported with the taint mode
> turned on. But there is -Wanalyzer-tained-divisor, is it expected?
> 
> __attribute__((tainted_args))
> int fun0(int a)
> { return a; }
> 
> int main()
> {
>   int b = 3 / fun0(0);
>   return b;
> }

Yes: in that the 0 came from the source of the program, rather than
from an attacker, so it's not tainted.  The analyzer doesn't have a
good way to attach state-machine state to a constant, only to other
kinds of symbolic value.

See
  gcc/testsuite/gcc.dg/analyzer/taint-divisor-1.c 
  gcc/testsuite/gcc.dg/plugin/taint-antipatterns-1.c
for examples that ought to report tainted divisors (the former from
"fread", the latter from "copy_from_user" via a plugin)


> 
> 5. I guess the project would mostly modify constraint-manager.h and
> sm-taint.cc <http://sm-taint.cc/>. Or are there other files that you
> suspect relevant for this project?

I think region-model.{cc,h} is likely to be very relevant here, and
possibly program-state.{cc,h}; I think one of the challenges will be to
see to what extent when we enable the taint state machine by default it
bloats the program states (much of which is handled in class
region_model) to the point where the exploded_graph gets much bigger,
and we lose coverage compared to what we had before.  I think we're
going to need to improve state purging so that e.g. if there's a buffer
containing tainted data that only gets used in one part of the function
that we can stop bothering to track its taintedness after it becomes
relevant.

I suspect the project may be rather open-ended, in that it's a case of
turning the feature on, trying it on real-world C projects (as well as
just the regression testsuite), and seeing:
- to what extent it's useful, and 
- to what extent it's spamming the user, and
- what breaks

and fixing the issues you encounter up to the point where it's
reasonable to enable the feature for GCC 14 (hopefully).

> 
> 6. Is the current implementation based on some papers?

I confess I haven't read much in this space; I'm looking forward to
reading the papers you linked to


>  I found this
> (https://users.ece.cmu.edu/~aavgerin/papers/Oakland10.pdf) and this
> (https://www.ndss-symposium.org/wp-content/uploads/2017/09/Dynamic-Ta
> int-Analysis-for-Automatic-Detection-Analysis-and-
> SignatureGeneration-of-Exploits-on-Commodity-Software-Dawn-Song.pdf),
> but haven’t started reading yet. In addition, purging states of the
> constraint manager sounds like a problem other people may have looked
> at. Is there any related progress since you documented in PR103533?
> 
> As you said, this would be an open-ended project, so it would be very
> helpful to get some feedback from you so that I know how to draft my
> proposal. 

(nods)

> In addition, is it ok to deviate from the proposal after I start
> working? 

Yes: as noted above, much of the project would be to try turning it on
for real-world C code, seeing what breaks, and fixing that, so we can't
yet know what that will be.  Depending on how hard the issues are
"success" for the project could be "fixed all issues and enabled it in
trunk for GCC 14" vs "identified and wrote up a set of issues that need
resolving", or somewhere in between.

Hope this makes sense (and isn't too intimidating!)
Dave


  reply	other threads:[~2023-03-13 15:51 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-21 21:26 [GSoC][Static Analyzer] Some questions and request for a small patch to work on Shengyu Huang
2023-02-21 22:55 ` David Malcolm
2023-02-22 14:03   ` Jonathan Wakely
2023-02-23 11:17     ` James K. Lowden
2023-02-23 16:30       ` Jonathan Wakely
2023-02-22 14:11   ` Shengyu Huang
2023-02-22 14:53     ` Iain Sandoe
2023-02-22 15:13       ` Iain Sandoe
2023-02-27 13:35       ` Shengyu Huang
2023-02-27 13:45         ` Iain Sandoe
2023-02-27 13:51           ` Shengyu Huang
2023-02-27 13:49         ` Iain Sandoe
2023-02-27 14:01           ` Floyd, Paul
2023-02-22 15:43     ` David Malcolm
2023-02-28  9:18       ` Shengyu Huang
2023-02-28 23:59         ` David Malcolm
2023-03-01 11:16           ` Shengyu Huang
2023-03-01 13:48             ` David Malcolm
2023-02-28 14:46     ` [GSoC][Static Analyzer] Ideas for proposal Shengyu Huang
2023-03-01  0:22       ` David Malcolm
2023-03-12 22:20         ` Shengyu Huang
2023-03-13 15:51           ` David Malcolm [this message]
2023-03-20 17:28             ` [GSoC][Static Analyzer] First proposal draft and a few more questions/requests Shengyu Huang
2023-03-20 22:50               ` David Malcolm
2023-03-26 16:03                 ` Shengyu Huang
2023-03-26 17:14                   ` David Malcolm
2023-03-26 21:46                     ` Shengyu Huang
2023-04-01 14:19                       ` Shengyu Huang
2023-04-02 22:53                         ` David Malcolm
2023-04-03  0:02                           ` Shengyu Huang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3dfad33dec50c9f8bfb13e42a29cfb41b6aab457.camel@redhat.com \
    --to=dmalcolm@redhat.com \
    --cc=gcc@gcc.gnu.org \
    --cc=kumom.huang@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).