From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by sourceware.org (Postfix) with ESMTPS id 3F2F03858D1E for ; Mon, 13 Mar 2023 15:51:51 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 3F2F03858D1E Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1678722710; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=M0ZCZWublfIY6JzOytlWnNB6ViSOTW21RLRxaQH1ql0=; b=Zwy3NkW3qMqpt/RPeh/fuoFIQ35DN4SQwyEUqpscdirwiJh6LFnKzTDJgG+6JQpIP9W0Xk 1gPD+ZR/qa6feZQXnj/ZDGgCuUTpnh19n51JStoAedBx8W6XV08xaMlFAAYj4RRjExWxEt djlsEkKhR70+CeJ5LKbJuOOev5m/8xk= Received: from mail-qk1-f198.google.com (mail-qk1-f198.google.com [209.85.222.198]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-214-H_3i4XyAP5WTI_D-zZSCFA-1; Mon, 13 Mar 2023 11:51:49 -0400 X-MC-Unique: H_3i4XyAP5WTI_D-zZSCFA-1 Received: by mail-qk1-f198.google.com with SMTP id l27-20020a05620a211b00b00745b3e62004so411432qkl.4 for ; Mon, 13 Mar 2023 08:51:49 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1678722709; h=mime-version:user-agent:content-transfer-encoding:references :in-reply-to:date:cc:to:from:subject:message-id:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=M0ZCZWublfIY6JzOytlWnNB6ViSOTW21RLRxaQH1ql0=; b=7BrvMMWesrwCoTm9radkX8P4yMvFJ9JQgMcVvEo/wGXE42K7OknU5dnwUlLzeHwr0X ciMNgZaa7wM3oElFqr9R3wi0DizmWGvQwkBQfZ12NUuDyhtGkj8zbnE7mQDD8Xv+4TF0 1IykmrAudSY0KxytZyas9E/11aoEUmvp3rzK2A/HGz+DB5bNCjN4+iV4Qmx8q2Wrgktk mYCXK/EgK8KD+jWkSxHgIqa+0n5/A8cyjC1GGpK0BdXxrG66bpmMaEfFUqUDxzOnI/0P l/g2d2ohbfemeZMv0q0cHYpQF7S11NYA4itfu8bis+JftYmKviQwj7p8zjlzmAAbQUMI b+5Q== X-Gm-Message-State: AO0yUKV6chjOgrvAlQ4GE/+TjyYgTSSJgQl07Mu3NfuuTUA2iHQ1gpDc FqX6qJiwpzcDlv431v3Y8jfXBg+TXr/LkIa5XCHgzapz9wz0i8KuriRcReG2p4nPphNcRkSsejc 27wpFM8w= X-Received: by 2002:ac8:7d91:0:b0:3bf:cfdd:702b with SMTP id c17-20020ac87d91000000b003bfcfdd702bmr65377989qtd.23.1678722708731; Mon, 13 Mar 2023 08:51:48 -0700 (PDT) X-Google-Smtp-Source: AK7set/bf+2F2e0YfdXJmjjLiNN2leYhbudSScUU0gycY/Aag7LllX+NscgJ/f2ig93zcVIMrv8TYA== X-Received: by 2002:ac8:7d91:0:b0:3bf:cfdd:702b with SMTP id c17-20020ac87d91000000b003bfcfdd702bmr65377952qtd.23.1678722708361; Mon, 13 Mar 2023 08:51:48 -0700 (PDT) Received: from t14s.localdomain (c-73-69-212-193.hsd1.ma.comcast.net. [73.69.212.193]) by smtp.gmail.com with ESMTPSA id f11-20020ac8014b000000b003bd21323c80sm86699qtg.11.2023.03.13.08.51.47 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 13 Mar 2023 08:51:47 -0700 (PDT) Message-ID: <3dfad33dec50c9f8bfb13e42a29cfb41b6aab457.camel@redhat.com> Subject: Re: [GSoC][Static Analyzer] Ideas for proposal From: David Malcolm To: Shengyu Huang Cc: GCC Development Date: Mon, 13 Mar 2023 11:51:47 -0400 In-Reply-To: <4CBE37A2-7D50-4ECC-9B70-951AB7176D9B@gmail.com> References: <960EE623-1B17-4321-B77E-FBCD9496BE1F@gmail.com> <40fbb064f56845908f797400e5d9443b6cf97fe4.camel@redhat.com> <0e6a972dac60ad290d21a82b428cc76c4e8565e9.camel@redhat.com> <4CBE37A2-7D50-4ECC-9B70-951AB7176D9B@gmail.com> User-Agent: Evolution 3.44.4 (3.44.4-1.fc36) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-4.9 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,KAM_SHORT,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE,TXREP,URI_DOTEDU autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Sun, 2023-03-12 at 23:20 +0100, Shengyu Huang wrote: > Hi Dave, >=20 > > >=20 > > > 4. What=E2=80=99s the most interesting to me are PR103533 > > > (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D103533), > >=20 > > Turning on taint detection by default would be a great project.=C2=A0 I= t > > would be good to run the integration tests: > > =C2=A0https://github.com/davidmalcolm/gcc-analyzer-integration-tests > > to see if anything regresses, or if it adds noise - so this might > > be a > > bit of an open-ended project, in that we'd want to fix whatever > > issues > > show up there, as well as the known ones that are documented in > > that > > bug. > >=20 >=20 > Sorry for replying to you late due to another project from my > university.=20 >=20 > Since most other ideas are being worked on by you or not big enough > to make a GSoC project, I decided to take on this project and have > been getting familiar with the analyzer this weekend.=C2=A0 Excellent; thanks. > I want to sort several things out before writing the proposal. >=20 > 1. What should I do with the integration tests? First of all, AFAIK I'm the only person who's tried running the integration tests. They're the test scripts I wrote to help me validate my own patches, so there will be rough edges; please let me know as you run into them, so I can fix/document them. I have scripts that run the integration test's test.py, passing in the path to the built gcc and a "run directory" where the builds happen; I do this for a "control" build of gcc, and for an "experiment" build that has a patch (each with their own run directory). =C2=A0This script attempts to use that gcc to build the various projects, capturing the diagnostics as lots of little .sarif files in the build dir. One of these run directories takes about 17G of drive space, and takes about an hour for me on a fast machine I have (64 cores). We'll probably need to get you set up with an account on the gcc compile farm, which has lots of powerful machines that you can ssh into, unless your university has something powerful you can use (with plenty of cores, RAM, and free disk space, e.g. at least 60G of disk) I then have a script that runs compare-by-warning.py, passing in the paths to the two run dirs; this recurses through the two rundirs, loading the .sarif files, and attempts to compare the before vs after diagnostics. I've attempted to classify the results I've seen via the known-issues/*.txt files, so that the comparison has some knowledge about whether the changes we've seen are e.g.: - new false positives vs=C2=A0 - new true positives vs - false positives going away vs=C2=A0 - true positives going away=20 (etc) That said, the "Juliet" results are currently rather unwieldy (many more results than for the other projects, and 9.1G of the 17G by space), so I tend to move them out of the way before doing the comparison. >=20 > 2. I ran gcc -fanalyzer -fanalyzer-checker=3Dtaint ./gcc- > src/gcc/testsuite/gcc.dg/analyzer/pr93032-mztools-signed-char.c , but > I got different results from what you documented in PR103533: >=20 > /usr/bin/ld: /lib/x86_64-linux-gnu/crt1.o: in function `_start': > (.text+0x17): undefined reference to `main' > collect2: error: ld returned 1 exit status gcc's default is to try to compile, assemble, and link into an executable. This testcase doesn't have a "main" function, hence the linker complains. If you pass "-S", it will merely compile the .c to a .s assembler file whilst still running the analyzer. In terms of actually running the test suite via DejaGnu, see: https://gcc-newbies-guide.readthedocs.io/en/latest/working-with-the-testsui= te.html I typically use: make -k -jN \ && time make check-gcc \ RUNTESTFLAGS=3D"-v -v --target_board=3Dunix\{-m32,-m64\} analyzer-= torture.exp=3D*.c analyzer.exp=3D*.c" when testing the analyzer regression test suite, where N is the number of cores on my box When I run an individual testcase, I do something like: ./xgcc -B. -S -fanalyzer ../../src/PATH_TO_TEST_CASE in the "gcc" subdirectory of the build directory. >=20 > 3. What does =E2=80=9CICE=E2=80=9D mean when you said =E2=80=9CICE in alt= _get_inherited_state > in abs-1.c, =E2=80=A6=E2=80=9D? ICE is our jargon for "internal compiler error" i.e. a crash of gcc itself. >=20 > 4. For the following program, nothing is reported with the taint mode > turned on. But there is -Wanalyzer-tained-divisor, is it expected? >=20 > __attribute__((tainted_args)) > int fun0(int a) > { return a; } >=20 > int main() > { > =C2=A0 int b =3D 3 / fun0(0); > =C2=A0 return b; > } Yes: in that the 0 came from the source of the program, rather than from an attacker, so it's not tainted. The analyzer doesn't have a good way to attach state-machine state to a constant, only to other kinds of symbolic value. See gcc/testsuite/gcc.dg/analyzer/taint-divisor-1.c=C2=A0 gcc/testsuite/gcc.dg/plugin/taint-antipatterns-1.c for examples that ought to report tainted divisors (the former from "fread", the latter from "copy_from_user" via a plugin) >=20 > 5. I guess the project would mostly modify constraint-manager.h and > sm-taint.cc . Or are there other files that you > suspect relevant for this project? I think region-model.{cc,h} is likely to be very relevant here, and possibly program-state.{cc,h}; I think one of the challenges will be to see to what extent when we enable the taint state machine by default it bloats the program states (much of which is handled in class region_model) to the point where the exploded_graph gets much bigger, and we lose coverage compared to what we had before. I think we're going to need to improve state purging so that e.g. if there's a buffer containing tainted data that only gets used in one part of the function that we can stop bothering to track its taintedness after it becomes relevant. I suspect the project may be rather open-ended, in that it's a case of turning the feature on, trying it on real-world C projects (as well as just the regression testsuite), and seeing: - to what extent it's useful, and=C2=A0 - to what extent it's spamming the user, and - what breaks and fixing the issues you encounter up to the point where it's reasonable to enable the feature for GCC 14 (hopefully). >=20 > 6. Is the current implementation based on some papers? I confess I haven't read much in this space; I'm looking forward to reading the papers you linked to > I found this > (https://users.ece.cmu.edu/~aavgerin/papers/Oakland10.pdf) and this > (https://www.ndss-symposium.org/wp-content/uploads/2017/09/Dynamic-Ta > int-Analysis-for-Automatic-Detection-Analysis-and- > SignatureGeneration-of-Exploits-on-Commodity-Software-Dawn-Song.pdf), > but haven=E2=80=99t started reading yet. In addition, purging states of t= he > constraint manager sounds like a problem other people may have looked > at. Is there any related progress since you documented in PR103533? >=20 > As you said, this would be an open-ended project, so it would be very > helpful to get some feedback from you so that I know how to draft my > proposal.=C2=A0 (nods) > In addition, is it ok to deviate from the proposal after I start > working?=20 Yes: as noted above, much of the project would be to try turning it on for real-world C code, seeing what breaks, and fixing that, so we can't yet know what that will be. Depending on how hard the issues are "success" for the project could be "fixed all issues and enabled it in trunk for GCC 14" vs "identified and wrote up a set of issues that need resolving", or somewhere in between. Hope this makes sense (and isn't too intimidating!) Dave