From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by sourceware.org (Postfix) with ESMTPS id E37F23858403 for ; Wed, 26 Jan 2022 14:26:17 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org E37F23858403 Received: from mail-qv1-f71.google.com (mail-qv1-f71.google.com [209.85.219.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-9-m8fa1x2nPt-2h-KzA86k_A-1; Wed, 26 Jan 2022 09:26:10 -0500 X-MC-Unique: m8fa1x2nPt-2h-KzA86k_A-1 Received: by mail-qv1-f71.google.com with SMTP id kl20-20020a056214519400b0042382bf37f2so9170504qvb.5 for ; Wed, 26 Jan 2022 06:26:10 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:subject:from:to:cc:date:in-reply-to :references:user-agent:mime-version:content-transfer-encoding; bh=zz9UVZ/LajJFKjxvhuZ/aWZcfnrk1IBerw5ltVao5ts=; b=xQTdoZt7v6X0ysQV9hONb/m3aM7sTcqAAwfmMAjX3CLYEizgLUEg+YU8nXcZPRn74S j6HOzha9x0HgdDbHsUkXLUyD2Z2E6BWf5npwfvnqxshfdELEvRWOIQNO26cu3fnCDHue jCqolawzLEwtzDLvv41ClfeMFKAJ0ahT2hDVA3l24R72Hi8f11BBtH5x1csmkbD+O8sW hp6SOwSfuxmod99cN1lqRQBe5Fz9vfxrilak6vo8lg+ubw/2YjrvVzArCQKEu7GczUv2 dsQxYRccy/SZUUkJsmtrzYgD268NlvC57VliP2+s3JMhsSA51rQ0Kjj2FbjutoVTPcwe XD/g== X-Gm-Message-State: AOAM532VvCyYfoLok1NUmBR5SWd72puPNHGC9TLEse1FQPej7Jp5EYNh yTmi8bbgHNmVl7oQYWVsJZHL1ksmDIkeZrYmZhQYWUoS0Bh2Q+dVOM4W8fXJB3483KYuFtRY8Bz dpUx5tX8= X-Received: by 2002:a05:6214:529e:: with SMTP id kj30mr11909850qvb.48.1643207170002; Wed, 26 Jan 2022 06:26:10 -0800 (PST) X-Google-Smtp-Source: ABdhPJxTzePLBF+9JyOESNlpWxT/u1QQ/Op15Ilj4JG3I6Xah3tswPw7n+eJPTbfMUDlUEO34Hbzag== X-Received: by 2002:a05:6214:529e:: with SMTP id kj30mr11909818qvb.48.1643207169600; Wed, 26 Jan 2022 06:26:09 -0800 (PST) Received: from t14s.localdomain (c-73-69-212-193.hsd1.nh.comcast.net. [73.69.212.193]) by smtp.gmail.com with ESMTPSA id m14sm10232927qtx.44.2022.01.26.06.26.08 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 26 Jan 2022 06:26:08 -0800 (PST) Message-ID: Subject: Re: GSoC: Working on the static analyzer From: David Malcolm To: Mir Immad Cc: gcc@gcc.gnu.org Date: Wed, 26 Jan 2022 09:26:07 -0500 In-Reply-To: References: <4eec5fa69b9daedcec5361c2cc18df7f1ef397af.camel@redhat.com> User-Agent: Evolution 3.38.4 (3.38.4-1.fc33) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-6.0 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, KAM_SHORT, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Jan 2022 14:26:19 -0000 On Mon, 2022-01-24 at 01:41 +0530, Mir Immad wrote: > Hi, sir. > > I've been trying to understand the static analyzer's code. I spent most > of > my time learning the state machine's API. I learned how state machine's > on_stmt is supposed to "recognize" specific functions and how > on_transition > takes a specific tree from one state to another, and how the captured > states are used by pending_diagnostics to report the errors. > Furthermore, I > was able to create a dummy checker that mimicked the behaviour of sm- > file's > double_fclose and compile GCC with these changes. Is this the right way > of > learning? This sounds great. > > As you've mentioned on the projects page that you would like to add > more > support for some POSIX APIs. Can you please write (or refer me to a) a > simple C program that uses such an API (and also what the analyzer > should > have done) so that I can attempt to add such a checker to the analyzer. A couple of project ideas: (i) treat data coming from a network connection as tainted, by somehow teaching the analyzer about networking APIs. Ideally: look at some subset of historical CVEs involving network-facing attacks on user- space daemons, and find a way to detect them in the analyzer (need to find a way to mark the incoming data as tainted, so that the analyer "know" about the trust boundary - that the incoming data needs to be sanitized and treated with extra caution; see https://gcc.gnu.org/pipermail/gcc-patches/2021-November/584372.html for my attempts to do this for the Linux kernel). Obviously this is potentially a huge project, so maybe just picking a tiny subset and getting that working as a proof-of-concept would be a good GSoC project. Maybe find an old CVE that someone has written a good write-up for, and think about "how could GCC/-fanalyzer have spotted it?" (ii) add leak-detection for POSIX file descriptors: i.e. the integer values returned by "open", "dup", etc. It would be good to have a check that the user's code doesn't leak these values e.g. on error- handling paths, by failing to close a file-descriptor (and not storing it anywhere). I think that much of this could be done by analogy with the sm-file.cc code. > > Also, I didn't realize the complexity of adding SARIF when I mentioned > it. > I'd rather work on adding more checkers. Fair enough. Hope this above is constructive. Dave > > Regards. > > Mir Immad > > On Sun, Jan 23, 2022, 11:04 PM Mir Immad wrote: > > > Hi Sir, > > > > I've been trying to understand the static analyzer's code. I spent > > most of > > my time learning the state machine's API. I learned how state > > machine's > > on_stmt is supposed to "recognize" specific functions and takes a  > > specific > > tree from one state to another, and how the captured states are used > > by > > pending_diagnostics to report the errors. Furthermore, I was able to > > create > > a dummy checker that mimicked the behaviour of sm-file's > > double_fclose and > > compile GCC with these changes. Is this the right way of learning? > > > > As you've mentioned on the projects page that you would like to add > > more > > support for some POSIX APIs. Can you please write (or refer me to a) > > a > > simple C program that uses such an API (and also what the analyzer > > should > > have done) so that I can attempt to add such a checker to the > > analyzer. > > > > Also, I didn't realize the complexity of adding SARIF when I > > mentioned it. > > I'd rather work on adding more checkers. > > > > Regards. > > Mir Immad > > > > On Mon, Jan 17, 2022 at 5:41 AM David Malcolm > > wrote: > > > > > On Fri, 2022-01-14 at 22:15 +0530, Mir Immad wrote: > > > > HI David, > > > > I've been tinkering with the static analyzer for the last few > > > > days. I > > > > find > > > > the project of adding SARIF output to the analyzer intresting. > > > > I'm > > > > writing > > > > this to let you know that I'm trying to learn the codebase. > > > > Thank you. > > > > > > Excellent. > > > > > > BTW, I think adding SARIF output would involve working more with > > > GCC's > > > diagnostics subsystem than with the static analyzer, since (in > > > theory) > > > all of the static analyzer's output is passing through the > > > diagnostics > > > subsystem - though the static analyzer is probably the only GCC > > > component generating diagnostic paths. > > > > > > I'm happy to mentor such a project as I maintain both subsystems > > > and > > > SARIF output would benefit both - but it would be rather tangential > > > to > > > the analyzer - so if you had specifically wanted to be working on > > > the > > > guts of the analyzer itself, you may want to pick a different > > > subproject. > > > > > > The SARIF standard is rather long and complicated, and we would > > > want to > > > be compatible with clang's implementation. > > > > > > It would be very cool if gcc could also accept SARIF files as an > > > *input* format, and emit them as diagnostics; that might help with > > > debugging SARIF output.   (I have a old patch for adding JSON > > > parsing > > > support to GCC that could be used as a starting point for this). > > > > > > Hope the above makes sense > > > Dave > > > > > > > > > > > On Tue, Jan 11, 2022, 7:09 PM David Malcolm < > > > > dmalcolm@redhat.com> > > > > wrote: > > > > > > > > > On Tue, 2022-01-11 at 11:03 +0530, Mir Immad via Gcc wrote: > > > > > > Hi everyone, > > > > > > > > > > Hi, and welcome. > > > > > > > > > > > I intend to work on the static analyzer. Are these documents > > > > > > enough to > > > > > > get > > > > > > started: https://gcc.gnu.org/onlinedocs/gccint and > > > > > > > > > > > > > > > > > https://gcc.gnu.org/onlinedocs/gccint/Analyzer-Internals.html#Analyzer-Internals > > > > > > > > > > Yes. > > > > > > > > > > There are also some high-level notes here: > > > > >   https://gcc.gnu.org/wiki/DavidMalcolm/StaticAnalyzer > > > > > > > > > > Also, given that the analyzer is part of GCC, the more general > > > > > introductions to hacking on GCC will be useful. > > > > > > > > > > I recommend creating a trivial C source file with a bug in it > > > > > (e.g. > > > > > a > > > > > 3-line function with a use-after-free), and stepping through > > > > > the > > > > > analyzer to get a sense of how it works. > > > > > > > > > > Hope this is helpful; don't hesitate to ask questions. > > > > > Dave > > > > > > > > > > > > > > > > > > >