From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=nsN7=7F=redhat.com=dmalcolm@sourceware.org>
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	by sourceware.org (Postfix) with ESMTPS id 3F2F03858D1E
	for <gcc@gcc.gnu.org>; Mon, 13 Mar 2023 15:51:51 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 3F2F03858D1E
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=redhat.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1678722710;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=M0ZCZWublfIY6JzOytlWnNB6ViSOTW21RLRxaQH1ql0=;
	b=Zwy3NkW3qMqpt/RPeh/fuoFIQ35DN4SQwyEUqpscdirwiJh6LFnKzTDJgG+6JQpIP9W0Xk
	1gPD+ZR/qa6feZQXnj/ZDGgCuUTpnh19n51JStoAedBx8W6XV08xaMlFAAYj4RRjExWxEt
	djlsEkKhR70+CeJ5LKbJuOOev5m/8xk=
Received: from mail-qk1-f198.google.com (mail-qk1-f198.google.com
 [209.85.222.198]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-214-H_3i4XyAP5WTI_D-zZSCFA-1; Mon, 13 Mar 2023 11:51:49 -0400
X-MC-Unique: H_3i4XyAP5WTI_D-zZSCFA-1
Received: by mail-qk1-f198.google.com with SMTP id l27-20020a05620a211b00b00745b3e62004so411432qkl.4
        for <gcc@gcc.gnu.org>; Mon, 13 Mar 2023 08:51:49 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112; t=1678722709;
        h=mime-version:user-agent:content-transfer-encoding:references
         :in-reply-to:date:cc:to:from:subject:message-id:x-gm-message-state
         :from:to:cc:subject:date:message-id:reply-to;
        bh=M0ZCZWublfIY6JzOytlWnNB6ViSOTW21RLRxaQH1ql0=;
        b=7BrvMMWesrwCoTm9radkX8P4yMvFJ9JQgMcVvEo/wGXE42K7OknU5dnwUlLzeHwr0X
         ciMNgZaa7wM3oElFqr9R3wi0DizmWGvQwkBQfZ12NUuDyhtGkj8zbnE7mQDD8Xv+4TF0
         1IykmrAudSY0KxytZyas9E/11aoEUmvp3rzK2A/HGz+DB5bNCjN4+iV4Qmx8q2Wrgktk
         mYCXK/EgK8KD+jWkSxHgIqa+0n5/A8cyjC1GGpK0BdXxrG66bpmMaEfFUqUDxzOnI/0P
         l/g2d2ohbfemeZMv0q0cHYpQF7S11NYA4itfu8bis+JftYmKviQwj7p8zjlzmAAbQUMI
         b+5Q==
X-Gm-Message-State: AO0yUKV6chjOgrvAlQ4GE/+TjyYgTSSJgQl07Mu3NfuuTUA2iHQ1gpDc
	FqX6qJiwpzcDlv431v3Y8jfXBg+TXr/LkIa5XCHgzapz9wz0i8KuriRcReG2p4nPphNcRkSsejc
	27wpFM8w=
X-Received: by 2002:ac8:7d91:0:b0:3bf:cfdd:702b with SMTP id c17-20020ac87d91000000b003bfcfdd702bmr65377989qtd.23.1678722708731;
        Mon, 13 Mar 2023 08:51:48 -0700 (PDT)
X-Google-Smtp-Source: AK7set/bf+2F2e0YfdXJmjjLiNN2leYhbudSScUU0gycY/Aag7LllX+NscgJ/f2ig93zcVIMrv8TYA==
X-Received: by 2002:ac8:7d91:0:b0:3bf:cfdd:702b with SMTP id c17-20020ac87d91000000b003bfcfdd702bmr65377952qtd.23.1678722708361;
        Mon, 13 Mar 2023 08:51:48 -0700 (PDT)
Received: from t14s.localdomain (c-73-69-212-193.hsd1.ma.comcast.net. [73.69.212.193])
        by smtp.gmail.com with ESMTPSA id f11-20020ac8014b000000b003bd21323c80sm86699qtg.11.2023.03.13.08.51.47
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 13 Mar 2023 08:51:47 -0700 (PDT)
Message-ID: <3dfad33dec50c9f8bfb13e42a29cfb41b6aab457.camel@redhat.com>
Subject: Re: [GSoC][Static Analyzer] Ideas for proposal
From: David Malcolm <dmalcolm@redhat.com>
To: Shengyu Huang <kumom.huang@gmail.com>
Cc: GCC Development <gcc@gcc.gnu.org>
Date: Mon, 13 Mar 2023 11:51:47 -0400
In-Reply-To: <4CBE37A2-7D50-4ECC-9B70-951AB7176D9B@gmail.com>
References: <960EE623-1B17-4321-B77E-FBCD9496BE1F@gmail.com>
	 <40fbb064f56845908f797400e5d9443b6cf97fe4.camel@redhat.com>
	 <E8D8996A-A2B9-4011-8093-96657AC89A80@gmail.com>
	 <F68036F8-37B0-4519-8906-B6A54C05F5BD@gmail.com>
	 <0e6a972dac60ad290d21a82b428cc76c4e8565e9.camel@redhat.com>
	 <4CBE37A2-7D50-4ECC-9B70-951AB7176D9B@gmail.com>
User-Agent: Evolution 3.44.4 (3.44.4-1.fc36)
MIME-Version: 1.0
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-4.9 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,KAM_SHORT,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE,TXREP,URI_DOTEDU autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc.gcc.gnu.org>

On Sun, 2023-03-12 at 23:20 +0100, Shengyu Huang wrote:
> Hi Dave,
>=20
> > >=20
> > > 4. What=E2=80=99s the most interesting to me are PR103533
> > > (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D103533),
> >=20
> > Turning on taint detection by default would be a great project.=C2=A0 I=
t
> > would be good to run the integration tests:
> > =C2=A0https://github.com/davidmalcolm/gcc-analyzer-integration-tests
> > to see if anything regresses, or if it adds noise - so this might
> > be a
> > bit of an open-ended project, in that we'd want to fix whatever
> > issues
> > show up there, as well as the known ones that are documented in
> > that
> > bug.
> >=20
>=20
> Sorry for replying to you late due to another project from my
> university.=20
>=20
> Since most other ideas are being worked on by you or not big enough
> to make a GSoC project, I decided to take on this project and have
> been getting familiar with the analyzer this weekend.=C2=A0

Excellent; thanks.

> I want to sort several things out before writing the proposal.
>=20
> 1. What should I do with the integration tests?

First of all, AFAIK I'm the only person who's tried running the
integration tests.  They're the test scripts I wrote to help me
validate my own patches, so there will be rough edges; please let me
know as you run into them, so I can fix/document them.

I have scripts that run the integration test's test.py, passing in the
path to the built gcc and a "run directory" where the builds happen; I
do this for a "control" build of gcc, and for an "experiment" build
that has a patch (each with their own run directory). =C2=A0This script
attempts to use that gcc to build the various projects, capturing the
diagnostics as lots of little .sarif files in the build dir.

One of these run directories takes about 17G of drive space, and takes
about an hour for me on a fast machine I have (64 cores).  We'll
probably need to get you set up with an account on the gcc compile
farm, which has lots of powerful machines that you can ssh into, unless
your university has something powerful you can use (with plenty of
cores, RAM, and free disk space, e.g. at least 60G of disk)

I then have a script that runs compare-by-warning.py, passing in the
paths to the two run dirs; this recurses through the two rundirs,
loading the .sarif files, and attempts to compare the before vs after
diagnostics.  I've attempted to classify the results I've seen via the
known-issues/*.txt files, so that the comparison has some knowledge
about whether the changes we've seen are e.g.:
- new false positives vs=C2=A0
- new true positives vs
- false positives going away vs=C2=A0
- true positives going away=20
(etc)

That said, the "Juliet" results are currently rather unwieldy (many
more results than for the other projects, and 9.1G of the 17G by
space), so I tend to move them out of the way before doing the
comparison.

>=20
> 2. I ran gcc -fanalyzer -fanalyzer-checker=3Dtaint ./gcc-
> src/gcc/testsuite/gcc.dg/analyzer/pr93032-mztools-signed-char.c , but
> I got different results from what you documented in PR103533:
>=20
> /usr/bin/ld: /lib/x86_64-linux-gnu/crt1.o: in function `_start':
> (.text+0x17): undefined reference to `main'
> collect2: error: ld returned 1 exit status

gcc's default is to try to compile, assemble, and link into an
executable.  This testcase doesn't have a "main" function, hence the
linker complains.  If you pass "-S", it will merely compile the .c to a
.s assembler file whilst still running the analyzer.

In terms of actually running the test suite via DejaGnu, see:
https://gcc-newbies-guide.readthedocs.io/en/latest/working-with-the-testsui=
te.html

I typically use:

  make -k -jN \
  && time make check-gcc \
         RUNTESTFLAGS=3D"-v -v --target_board=3Dunix\{-m32,-m64\} analyzer-=
torture.exp=3D*.c analyzer.exp=3D*.c"

when testing the analyzer regression test suite, where N is the number
of cores on my box

When I run an individual testcase, I do something like:

./xgcc -B. -S -fanalyzer ../../src/PATH_TO_TEST_CASE

in the "gcc" subdirectory of the build directory.


>=20
> 3. What does =E2=80=9CICE=E2=80=9D mean when you said =E2=80=9CICE in alt=
_get_inherited_state
> in abs-1.c, =E2=80=A6=E2=80=9D?

ICE is our jargon for "internal compiler error" i.e. a crash of gcc
itself.

>=20
> 4. For the following program, nothing is reported with the taint mode
> turned on. But there is -Wanalyzer-tained-divisor, is it expected?
>=20
> __attribute__((tainted_args))
> int fun0(int a)
> { return a; }
>=20
> int main()
> {
> =C2=A0 int b =3D 3 / fun0(0);
> =C2=A0 return b;
> }

Yes: in that the 0 came from the source of the program, rather than
from an attacker, so it's not tainted.  The analyzer doesn't have a
good way to attach state-machine state to a constant, only to other
kinds of symbolic value.

See
  gcc/testsuite/gcc.dg/analyzer/taint-divisor-1.c=C2=A0
  gcc/testsuite/gcc.dg/plugin/taint-antipatterns-1.c
for examples that ought to report tainted divisors (the former from
"fread", the latter from "copy_from_user" via a plugin)


>=20
> 5. I guess the project would mostly modify constraint-manager.h and
> sm-taint.cc <http://sm-taint.cc/>. Or are there other files that you
> suspect relevant for this project?

I think region-model.{cc,h} is likely to be very relevant here, and
possibly program-state.{cc,h}; I think one of the challenges will be to
see to what extent when we enable the taint state machine by default it
bloats the program states (much of which is handled in class
region_model) to the point where the exploded_graph gets much bigger,
and we lose coverage compared to what we had before.  I think we're
going to need to improve state purging so that e.g. if there's a buffer
containing tainted data that only gets used in one part of the function
that we can stop bothering to track its taintedness after it becomes
relevant.

I suspect the project may be rather open-ended, in that it's a case of
turning the feature on, trying it on real-world C projects (as well as
just the regression testsuite), and seeing:
- to what extent it's useful, and=C2=A0
- to what extent it's spamming the user, and
- what breaks

and fixing the issues you encounter up to the point where it's
reasonable to enable the feature for GCC 14 (hopefully).

>=20
> 6. Is the current implementation based on some papers?

I confess I haven't read much in this space; I'm looking forward to
reading the papers you linked to


>  I found this
> (https://users.ece.cmu.edu/~aavgerin/papers/Oakland10.pdf) and this
> (https://www.ndss-symposium.org/wp-content/uploads/2017/09/Dynamic-Ta
> int-Analysis-for-Automatic-Detection-Analysis-and-
> SignatureGeneration-of-Exploits-on-Commodity-Software-Dawn-Song.pdf),
> but haven=E2=80=99t started reading yet. In addition, purging states of t=
he
> constraint manager sounds like a problem other people may have looked
> at. Is there any related progress since you documented in PR103533?
>=20
> As you said, this would be an open-ended project, so it would be very
> helpful to get some feedback from you so that I know how to draft my
> proposal.=C2=A0

(nods)

> In addition, is it ok to deviate from the proposal after I start
> working?=20

Yes: as noted above, much of the project would be to try turning it on
for real-world C code, seeing what breaks, and fixing that, so we can't
yet know what that will be.  Depending on how hard the issues are
"success" for the project could be "fixed all issues and enabled it in
trunk for GCC 14" vs "identified and wrote up a set of issues that need
resolving", or somewhere in between.

Hope this makes sense (and isn't too intimidating!)
Dave