From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ej1-x630.google.com (mail-ej1-x630.google.com [IPv6:2a00:1450:4864:20::630]) by sourceware.org (Postfix) with ESMTPS id 1CCFB3857809 for ; Tue, 1 Feb 2022 14:58:27 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 1CCFB3857809 Received: by mail-ej1-x630.google.com with SMTP id j2so54398626ejk.6 for ; Tue, 01 Feb 2022 06:58:27 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=ATGHEFL7xVnkb+J5TrXmFIzYjWzZK2tRQTRTaUxAW34=; b=c5Imdq7EqBJVVJcHW3s82VR4hunB7vgCBaJW5xrhneK1HcHChbohcDYvkVs2ge5bhK 3btuQUSYwTm0M3xfOpViQPmFd95I8ObxhlqDvpsF7am22RKa7dMI07LoJ5rd4hXlGNcu lVLcOBLJzK8RgikAbxjS06xqb1mOajCKYdVu/inTwlD/EX5hKsDR/D42fhEWrvvFq3/Y hNjTG69lRZaM/fWf6dfXni4aD8r1JssYN4YAPJoqb3vd9cTDO+0of6n5mlbplOyu1PE7 2+UlUY77BpnEDUcwyLATscSHU4RXszGtEjKXJEEL5v9iDsKdSSIhc3xX4Re5RiyS/7oz n3Sw== X-Gm-Message-State: AOAM533PG/x75/exoEtcL5YNohAFYjx0cS3g0yxGWbt9CkIZ29ImuL6n YbdAPLTs74KRxEpPJ9xnyNIS35bNiSSmqeCVYhQ= X-Google-Smtp-Source: ABdhPJw3GMEBPykZTLlwpLUJq+v+PogYGBMyw3n0gy48nRPe3zh/It/RWJ3YKQhw1bDqO/fTi/FhUDR4hC3IqyvKfHY= X-Received: by 2002:a17:907:72c6:: with SMTP id du6mr21504728ejc.220.1643727505857; Tue, 01 Feb 2022 06:58:25 -0800 (PST) MIME-Version: 1.0 References: <4eec5fa69b9daedcec5361c2cc18df7f1ef397af.camel@redhat.com> <2cc1022a7d7bc2b00ceac2175ada29554d961bff.camel@redhat.com> In-Reply-To: <2cc1022a7d7bc2b00ceac2175ada29554d961bff.camel@redhat.com> From: Mir Immad Date: Tue, 1 Feb 2022 20:28:13 +0530 Message-ID: Subject: Re: GSoC: Working on the static analyzer To: David Malcolm Cc: gcc@gcc.gnu.org Content-Type: multipart/mixed; boundary="000000000000183e0705d6f6224e" X-Spam-Status: No, score=-0.8 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_ENVFROM_END_DIGIT, FREEMAIL_FROM, HTML_MESSAGE, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-Content-Filtered-By: Mailman/MimeDel 2.1.29 X-BeenThere: gcc@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Feb 2022 14:58:30 -0000 --000000000000183e0705d6f6224e Content-Type: text/plain; charset="UTF-8" I worked around the leak detection and also handled the case where the fd is not saved. I wonder why sm-file hasn't implemented it yet. I'm attaching a text file with analyzer warnings. I'm still on gcc-11.2.0, will move to v12 next thing. > I wonder if it's worth checking for attempts to write to a fd that was > opened with O_RDONLY, or the converse? Yes, it does not make sense to warn for write on read-only closed fd or converse. But, it does make sense to warn for write operation on valid read-only fd. For this, we might have to analyze calls to the open() carefully -- get the literal value of the second argument to the call and do bit-mask decomposition on it to find if O_RDONLY or O_WRONLY are in it. We might have to maintain separate vectors for these and check for them in calls to write and read -- warn if write is being called on read-only fd and converse. How do we equate them? Does gimple store unique ids for each tree? or something like SAME_TREE? Just thinking out loud. > how much state does it make sense to track for a fd? Sorry, I didnt get it. Do you mean how may states I'm using? unchecked, closed, null and leak_by_not_saved (for the call-tree that does not have a lhs). > Also, at some point, we're going to have to handle "errno" - but given > that might be somewhat fiddly it's OK to defer that until you're more > familiar with the code I'm looking forward to figure it out. Thank you. On Sat, Jan 29, 2022 at 11:09 PM David Malcolm wrote: > On Sat, 2022-01-29 at 20:22 +0530, Mir Immad wrote: > > Thank you for the detailed information. > > > > I've been looking into the integer posix file descriptor APIs and I > > decided to write proof-of-concept checker for them. (not caring > > about > > errno). The checker tracks the fd returned by open(), warns if dup() > > is > > called with closed fd otherwise tracks the fd returned by dup(), it > > also > > warns if read() and write() functions were called on closed fd. I'm > > attaching a text file that lists some c sources and warnings by the > > static > > analyzer. I've used the diagnostic meta-data from sm-file. Is this > > something that could also be added to the analyzer? > > This looks great, and very promising as both new functionality for GCC > 13, and as a GSoC 2022 project. > > BTW, it looks like you're working with GCC 11, but the analyzer has > changed quite a bit on trunk for GCC 12, so it's worth trying to track > trunk. > > I wonder if it's worth checking for attempts to write to a fd that was > opened with O_RDONLY, or the converse? (I'm not sure, just thinking > aloud - how much state does it make sense to track for a fd?). > > Also, at some point, we're going to have to handle "errno" - but given > that might be somewhat fiddly it's OK to defer that until you're more > familiar with the code. > > > > > About the fd leak, that's the next thing I'll try to get working. > > Since > > you've mentioned that it could be a GSoC project, this is what I'm > > going to > > focus on. > > Excellent. > > Let me know (via this mailing list) if you have any questions. > > Thanks > Dave > > > > > Regards. > > > > > > > > On Wed, Jan 26, 2022 at 7:56 PM David Malcolm > > wrote: > > > > > On Mon, 2022-01-24 at 01:41 +0530, Mir Immad wrote: > > > > Hi, sir. > > > > > > > > I've been trying to understand the static analyzer's code. I > > > > spent most > > > > of > > > > my time learning the state machine's API. I learned how state > > > > machine's > > > > on_stmt is supposed to "recognize" specific functions and how > > > > on_transition > > > > takes a specific tree from one state to another, and how the > > > > captured > > > > states are used by pending_diagnostics to report the errors. > > > > Furthermore, I > > > > was able to create a dummy checker that mimicked the behaviour of > > > > sm- > > > > file's > > > > double_fclose and compile GCC with these changes. Is this the > > > > right way > > > > of > > > > learning? > > > > > > This sounds great. > > > > > > > > > > > As you've mentioned on the projects page that you would like to > > > > add > > > > more > > > > support for some POSIX APIs. Can you please write (or refer me to > > > > a) a > > > > simple C program that uses such an API (and also what the > > > > analyzer > > > > should > > > > have done) so that I can attempt to add such a checker to the > > > > analyzer. > > > > > > A couple of project ideas: > > > > > > (i) treat data coming from a network connection as tainted, by > > > somehow > > > teaching the analyzer about networking APIs. Ideally: look at some > > > subset of historical CVEs involving network-facing attacks on user- > > > space daemons, and find a way to detect them in the analyzer (need > > > to > > > find a way to mark the incoming data as tainted, so that the > > > analyer > > > "know" about the trust boundary - that the incoming data needs to > > > be > > > sanitized and treated with extra caution; see > > > https://gcc.gnu.org/pipermail/gcc-patches/2021-November/584372.html > > > for my attempts to do this for the Linux kernel). > > > > > > Obviously this is potentially a huge project, so maybe just picking > > > a > > > tiny subset and getting that working as a proof-of-concept would be > > > a > > > good GSoC project. Maybe find an old CVE that someone has written > > > a > > > good write-up for, and think about "how could GCC/-fanalyzer have > > > spotted it?" > > > > > > (ii) add leak-detection for POSIX file descriptors: i.e. the > > > integer > > > values returned by "open", "dup", etc. It would be good to have a > > > check that the user's code doesn't leak these values e.g. on error- > > > handling paths, by failing to close a file-descriptor (and not > > > storing > > > it anywhere). I think that much of this could be done by analogy > > > with > > > the sm-file.cc code. > > > > > > > > > > > > > > Also, I didn't realize the complexity of adding SARIF when I > > > > mentioned > > > > it. > > > > I'd rather work on adding more checkers. > > > > > > Fair enough. > > > > > > Hope this above is constructive. > > > > > > Dave > > > > > > > > > > > Regards. > > > > > > > > Mir Immad > > > > > > > > On Sun, Jan 23, 2022, 11:04 PM Mir Immad > > > > wrote: > > > > > > > > > Hi Sir, > > > > > > > > > > I've been trying to understand the static analyzer's code. I > > > > > spent > > > > > most of > > > > > my time learning the state machine's API. I learned how state > > > > > machine's > > > > > on_stmt is supposed to "recognize" specific functions and takes > > > > > a > > > > > specific > > > > > tree from one state to another, and how the captured states are > > > > > used > > > > > by > > > > > pending_diagnostics to report the errors. Furthermore, I was > > > > > able to > > > > > create > > > > > a dummy checker that mimicked the behaviour of sm-file's > > > > > double_fclose and > > > > > compile GCC with these changes. Is this the right way of > > > > > learning? > > > > > > > > > > As you've mentioned on the projects page that you would like to > > > > > add > > > > > more > > > > > support for some POSIX APIs. Can you please write (or refer me > > > > > to a) > > > > > a > > > > > simple C program that uses such an API (and also what the > > > > > analyzer > > > > > should > > > > > have done) so that I can attempt to add such a checker to the > > > > > analyzer. > > > > > > > > > > Also, I didn't realize the complexity of adding SARIF when I > > > > > mentioned it. > > > > > I'd rather work on adding more checkers. > > > > > > > > > > Regards. > > > > > Mir Immad > > > > > > > > > > On Mon, Jan 17, 2022 at 5:41 AM David Malcolm < > > > > > dmalcolm@redhat.com> > > > > > wrote: > > > > > > > > > > > On Fri, 2022-01-14 at 22:15 +0530, Mir Immad wrote: > > > > > > > HI David, > > > > > > > I've been tinkering with the static analyzer for the last > > > > > > > few > > > > > > > days. I > > > > > > > find > > > > > > > the project of adding SARIF output to the analyzer > > > > > > > intresting. > > > > > > > I'm > > > > > > > writing > > > > > > > this to let you know that I'm trying to learn the codebase. > > > > > > > Thank you. > > > > > > > > > > > > Excellent. > > > > > > > > > > > > BTW, I think adding SARIF output would involve working more > > > > > > with > > > > > > GCC's > > > > > > diagnostics subsystem than with the static analyzer, since > > > > > > (in > > > > > > theory) > > > > > > all of the static analyzer's output is passing through the > > > > > > diagnostics > > > > > > subsystem - though the static analyzer is probably the only > > > > > > GCC > > > > > > component generating diagnostic paths. > > > > > > > > > > > > I'm happy to mentor such a project as I maintain both > > > > > > subsystems > > > > > > and > > > > > > SARIF output would benefit both - but it would be rather > > > > > > tangential > > > > > > to > > > > > > the analyzer - so if you had specifically wanted to be > > > > > > working on > > > > > > the > > > > > > guts of the analyzer itself, you may want to pick a different > > > > > > subproject. > > > > > > > > > > > > The SARIF standard is rather long and complicated, and we > > > > > > would > > > > > > want to > > > > > > be compatible with clang's implementation. > > > > > > > > > > > > It would be very cool if gcc could also accept SARIF files as > > > > > > an > > > > > > *input* format, and emit them as diagnostics; that might help > > > > > > with > > > > > > debugging SARIF output. (I have a old patch for adding JSON > > > > > > parsing > > > > > > support to GCC that could be used as a starting point for > > > > > > this). > > > > > > > > > > > > Hope the above makes sense > > > > > > Dave > > > > > > > > > > > > > > > > > > > > On Tue, Jan 11, 2022, 7:09 PM David Malcolm < > > > > > > > dmalcolm@redhat.com> > > > > > > > wrote: > > > > > > > > > > > > > > > On Tue, 2022-01-11 at 11:03 +0530, Mir Immad via Gcc > > > > > > > > wrote: > > > > > > > > > Hi everyone, > > > > > > > > > > > > > > > > Hi, and welcome. > > > > > > > > > > > > > > > > > I intend to work on the static analyzer. Are these > > > > > > > > > documents > > > > > > > > > enough to > > > > > > > > > get > > > > > > > > > started: https://gcc.gnu.org/onlinedocs/gccint and > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://gcc.gnu.org/onlinedocs/gccint/Analyzer-Internals.html#Analyzer-Internals > > > > > > > > > > > > > > > > Yes. > > > > > > > > > > > > > > > > There are also some high-level notes here: > > > > > > > > https://gcc.gnu.org/wiki/DavidMalcolm/StaticAnalyzer > > > > > > > > > > > > > > > > Also, given that the analyzer is part of GCC, the more > > > > > > > > general > > > > > > > > introductions to hacking on GCC will be useful. > > > > > > > > > > > > > > > > I recommend creating a trivial C source file with a bug > > > > > > > > in it > > > > > > > > (e.g. > > > > > > > > a > > > > > > > > 3-line function with a use-after-free), and stepping > > > > > > > > through > > > > > > > > the > > > > > > > > analyzer to get a sense of how it works. > > > > > > > > > > > > > > > > Hope this is helpful; don't hesitate to ask questions. > > > > > > > > Dave > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --000000000000183e0705d6f6224e Content-Type: text/plain; charset="UTF-8"; name="gcc-leak-output.txt" Content-Disposition: attachment; filename="gcc-leak-output.txt" Content-Transfer-Encoding: base64 Content-ID: X-Attachment-Id: f_kz48dn4y0 JCBjYXQgbGVhay5jCgojaW5jbHVkZSA8ZmNudGwuaD4KCnZvaWQgdGVzdCAoKQp7CiAgIGludCBm ZCA9IG9wZW4oInRlc3QiLCBPX1JET05MWSk7Cn0KCnZvaWQgbWFpbigpe30KCiQgZ2NjLTExLjIu MCBsZWFrLmMgLWZhbmFseXplcgoKbGVhay5jOiBJbiBmdW5jdGlvbiDigJh0ZXN04oCZOgpsZWFr LmM6NjoxOiB3YXJuaW5nOiBsZWFrIG9mIEZJTEUg4oCYZmTigJkgW0NXRS03NzVdIFstV2FuYWx5 emVyLWZpbGUtbGVha10KICAgIDYgfCB9CiAgICAgIHwgXgogIOKAmHRlc3TigJk6IGV2ZW50cyAx LTIKICAgIHwKICAgIHwgICAgNSB8ICAgIGludCBmZCA9IG9wZW4oInRlc3QiLCBPX1JET05MWSk7 CiAgICB8ICAgICAgfCAgICAgICAgICAgICBefn5+fn5+fn5+fn5+fn5+fn5+fn5+CiAgICB8ICAg ICAgfCAgICAgICAgICAgICB8CiAgICB8ICAgICAgfCAgICAgICAgICAgICAoMSkgb3BlbmVkIGhl cmUKICAgIHwgICAgNiB8IH0KICAgIHwgICAgICB8IH4gICAgICAgICAgICAKICAgIHwgICAgICB8 IHwKICAgIHwgICAgICB8ICgyKSBsZWFrcyBoZXJlCiAgICB8CgokIGNhdCBsZWFrLWJ5LW5vdC1z YXZpbmcuYwoKI2luY2x1ZGUgPGZjbnRsLmg+Cgp2b2lkIHRlc3QgKCkKewogICBvcGVuKCJ0ZXN0 IiwgT19SRE9OTFkpOwp9Cgp2b2lkIG1haW4oKXt9CgokIGdjYyBsZWFrLWJ5LW5vdC1zYXZpbmcu YyAtZmFuYWx5emVyCgpsZWFrLmM6IEluIGZ1bmN0aW9uIOKAmHRlc3TigJk6CmxlYWsuYzo1OjQ6 IHdhcm5pbmc6IGxlYWsgYnkgbm90IHNhdmluZyBmZCBbQ1dFLTc3NV0gWy1XYW5hbHl6ZXItZmls ZS1sZWFrXQogICAgNSB8ICAgIG9wZW4oInRlc3QiLCBPX1JET05MWSk7CiAgICAgIHwgICAgXn5+ fn5+fn5+fn5+fn5+fn5+fn5+fgogIOKAmHRlc3TigJk6IGV2ZW50IDEKICAgIHwKICAgIHwgICAg NSB8ICAgIG9wZW4oInRlc3QiLCBPX1JET05MWSk7CiAgICB8ICAgICAgfCAgICBefn5+fn5+fn5+ fn5+fn5+fn5+fn5+CiAgICB8ICAgICAgfCAgICB8CiAgICB8ICAgICAgfCAgICAoMSkgbGVhayBi eSBub3Qgc2F2aW5nIGZkCiAgICB8Cg== --000000000000183e0705d6f6224e--