From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=4O2p=7S=redhat.com=dmalcolm@sourceware.org>
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	by sourceware.org (Postfix) with ESMTPS id CF3793858D39
	for <gcc@gcc.gnu.org>; Sun, 26 Mar 2023 15:58:10 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org CF3793858D39
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=redhat.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1679846290;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=NalkeBVjttTf+kG/TgLIsiZgyVC/ogsxmWpA0WMAr7s=;
	b=ZAvfs7CLvyPeXSV1RSdV0qqfpy4KMh3TOkGtuXuACqAdxTv723+z1VA+B7cDU//BMXRNL+
	cJlAWFdO5Yzz/FWyotchKwm7GcZEhjPNm7VQTTutrf4CsEgVOUbd4XopMbmrk1ayzAnxrV
	uVRo4/ldzdZ1mTDL+0A0+hD7pj2KJiQ=
Received: from mail-qk1-f197.google.com (mail-qk1-f197.google.com
 [209.85.222.197]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-81-53WgGC7oP6Ke1IyC_X50dw-1; Sun, 26 Mar 2023 11:58:09 -0400
X-MC-Unique: 53WgGC7oP6Ke1IyC_X50dw-1
Received: by mail-qk1-f197.google.com with SMTP id r70-20020a374449000000b00746c31401f0so2915282qka.6
        for <gcc@gcc.gnu.org>; Sun, 26 Mar 2023 08:58:09 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112; t=1679846288;
        h=mime-version:user-agent:content-transfer-encoding:references
         :in-reply-to:date:to:from:subject:message-id:x-gm-message-state:from
         :to:cc:subject:date:message-id:reply-to;
        bh=NalkeBVjttTf+kG/TgLIsiZgyVC/ogsxmWpA0WMAr7s=;
        b=ISHj5p1OgOKQet4FT3vSRTMAMITrKvrDyRCND8tayN/04J54IbxLSofTF9EYW0llTq
         6bjfV7m+QaqNcwhOvsBKEXiOeXkqkDIMKbY5f98Q1rDQT3/s3XXnVZU5/Z3WSdSwObcm
         Nw2J4ddRfKNqxyxlzlCNdmnxC2ydVOCmtPhLB6vP21zspfVjtIzPeMPwLFhXSeWhs6I6
         Ha1HUEvhNp2ybEph1JOrK3AMOAdjBkXY3pxhl1MLK1ENBouWOv4S1SpsT4CdkTIJczft
         w1ZD77xi/Fqz+QK7KHVGyGRoSP9BHju0XoYEpzQhjekGjGhlQ/uU6A4su8IzOGxVzZyc
         nwvw==
X-Gm-Message-State: AAQBX9eLXvhkzeTq5ppg/N0BWE7ROHjEcexdDakf9Y/I0nNXMttJosCV
	PNEXyk8Ne4q86ns3n/AXD6oiAmigQ/MSglKj/4AkchQw2v4NI4fSPkxGy30D9o9beRs7xHuCFiy
	e8rmKa7W/OJlA62I=
X-Received: by 2002:a05:6214:20aa:b0:56e:93de:59c6 with SMTP id 10-20020a05621420aa00b0056e93de59c6mr14429674qvd.37.1679846288012;
        Sun, 26 Mar 2023 08:58:08 -0700 (PDT)
X-Google-Smtp-Source: AKy350byLPGrKykNSRRMbEsiVaaABTzvPhpKNhPDHmMLu6DVMV24K0yW2rZlqfhXjcZbY8aSfVDpLg==
X-Received: by 2002:a05:6214:20aa:b0:56e:93de:59c6 with SMTP id 10-20020a05621420aa00b0056e93de59c6mr14429654qvd.37.1679846287519;
        Sun, 26 Mar 2023 08:58:07 -0700 (PDT)
Received: from t14s.localdomain (c-73-69-212-193.hsd1.ma.comcast.net. [73.69.212.193])
        by smtp.gmail.com with ESMTPSA id j1-20020a378701000000b007456b51ee13sm15982010qkd.16.2023.03.26.08.58.06
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Sun, 26 Mar 2023 08:58:07 -0700 (PDT)
Message-ID: <9698600391b2cb611dfa8fee5540258ed0cafb1e.camel@redhat.com>
Subject: Re: [GSoC] Interest and initial proposal for project on
 reimplementing cpychecker as -fanalyzer plugin
From: David Malcolm <dmalcolm@redhat.com>
To: Eric Feng <ef2648@columbia.edu>, gcc@gcc.gnu.org
Date: Sun, 26 Mar 2023 11:58:00 -0400
In-Reply-To: <CANGHATW9MARRSmMmrAr266LymWn8ERTCbs+Hh6sbFU+RR95_qA@mail.gmail.com>
References: <CANGHATW9MARRSmMmrAr266LymWn8ERTCbs+Hh6sbFU+RR95_qA@mail.gmail.com>
User-Agent: Evolution 3.44.4 (3.44.4-1.fc36)
MIME-Version: 1.0
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-5.7 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,KAM_SHORT,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc.gcc.gnu.org>

On Sat, 2023-03-25 at 15:38 -0400, Eric Feng via Gcc wrote:
> Hi GCC community,
>=20
> For GSoC, I am extremely interested in working on the selected
> project
> idea with respect to extending the static analysis pass. In
> particular, porting gcc-python-plugin's cpychecker to a plugin for
> GCC
> -fanalyzer as described in
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D107646.

Hi Eric, welcome to the GCC commmunity.

I'm the author/maintainer of GCC's static analysis pass.  I'm also the
author of gcc-python-plugin and its erstwhile "cpychecker" code, so I'm
pleased that you're interested in the project.

I wrote gcc-python-plugin and cpychecker over a decade ago when I was
focused on CPython development (before I switched to GCC development),
but it's heavily bitrotted over the years, as I didn't have enough
cycles to keep it compatible with changes in both GCC and CPython
whilst working on GCC itself.  In particular, the cpychecker code
stopped working a number of GCC releases ago.  However, the cpychecker
code inspired much of my work on GCC's static analysis pass and on its
diagnostics subsystem, so much of it now lives on in C++ form as core
GCC functionality.  Also, the Python community would continue to find
static analysis of CPython extension modules useful, so it would be
good to have the idea live on as a GCC plugin on top of -fanalyzer.

>  Please find an
> initial draft of my proposal below and let me know if it is a
> reasonable starting point. Please also correct me if I am
> misunderstanding any particular tasks and let me know what areas I
> should add more information for or what else I may do in preparation.

Some ideas for familiarizing yourself with the problem space:

You should try building GCC from source, and hack in a trivial warning
that emits "hello world, I'm compiling function 'foo'".  I wrote a
guide to GCC for new contributors here that should get you started:
  https://gcc-newbies-guide.readthedocs.io/en/latest/
This will help you get familiar with GCC's internals, and although the
plan is to write a plugin, I expect that you'll run into places where a
patch to GCC itself is more appropriate (bugs and missing functionality
), so having your own debug build of GCC is a good idea.

You should become familiar with CPython's extension and embedding API.
See the excellent documentation here:
  https://docs.python.org/3/extending/extending.html
It's probably a good exercise to write your own trivial CPython
extension module.

You can read the old cpychecker code inside the gcc-python-plugin
repository, and I gave a couple of talks on it as PyCon a decade ago:

PyCon2012: "Static analysis of Python extension modules using GCC"
https://pyvideo.org/pycon-us-2012/static-analysis-of-python-extension-modul=
es-using.html

PyCon2013: "Death by a thousand leaks: what statically-analysing 370
Python extensions looks like"
https://pyvideo.org/pycon-us-2013/death-by-a-thousand-leaks-what-statically=
-analys.html
https://www.youtube.com/watch?v=3DbblvGKzZfFI

(sorry about all the "ums" and "errs"; it's fascinating and
embarrassing to watch myself from 11 years ago on this, and see how
much I've both forgotten and learned in the meantime.  Revisiting this
work, I'm ashamed to see that I was referring to the implementation as
based on "abstract interpretation" (and e.g. absinterp.py), when I now
realize it's actually based on symbolic execution (as is GCC's-
fanalyzer)

Also, this was during the transition era between Python 2 and Python 3,
whereas now we only have to care about Python 3.

There may be other caveats; I haven't fully rewatched those talks yet
:-/

Various comments inline below, throughout...

>=20
> _______
>=20
> Describe the project and clearly define its goals:
> One pertinent use case of the gcc-python plugin is as a static
> analysis tool for CPython extension modules.

It might be more accurate to use the past tense when referring to the
gcc-python plugin, alas.

>  The main goal is to help
> programmers writing extensions identify common coding errors.
> Broadly,
> the goal of this project is to port the functionalities of cpychecker
> to a -fanalyzer plugin.

(nods)

>=20
> Below is a brief description of the functionalities of the static
> analysis tool for which I will work on porting over to a -fanalyzer
> plugin. The structure of the objectives is taken from the
> gcc-python-plugin documentation:
>=20
> Reference count checking: Manipulation of PyObjects is done via the
> CPython API and in particular with respect to the objects' reference
> count. When the reference count belonging to an object drops to zero,
> we should free all resources associated with it. This check helps
> ensure programmers identify problems with the reference count
> associated with an object. For example, memory leaks with respect to
> forgetting to decrement the reference count of an object (analogous
> to
> malloc() without corresponding free()) or perhaps object access after
> the object's reference count is zero (analogous to access after
> free()).

(nods)
>=20
> Error-handling checking: Various checks for common errors such as
> dereferencing a NULL value.

Yes.  This is already done by -fanalyzer, but we need some way for it
to know the outcomes of specific functions: e.g.=C2=A0one common pattern is
that API function "PyFoo_Bar" could either:
(a) succeed, returning a PyObject * that the caller "owns" a reference
to, or
(b) fail, returning NULL, and setting an exception on the thread-local
interpreter state object


>=20
> Errors in exception-handling: Checks for situations in which
> functions
> returning PyObject* that is NULL are not gracefully handled.

Yes; detection of this would automatically happen if we implemented
known_function subclasses e.g. for the pattern above.
>=20
> Format string checking: Verify that arguments to various CPython APIs
> which take format strings are correct.

Have a look at:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D107017
("RFE: support printf-style formatted functions in -fanalyzer") and:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D100121
("RFE: plugin support for -Wformat via __attribute__((format()))")


>=20
> Associating PyTypeObject instances with compile-time-types: Verify
> that the run-time type of a PyTypeObject matches its corresponding
> compile-time type for inputs where both are provided.

(nods)

>=20
> Verification of PyMethodDef tables: Verify that the function in
> PyMethodDef tables matches the calling convention of the ml_flags
> set.

(nods)

>=20
> I suspect a good starting point would be existing proof-of-concept
> -fanalyzer plugins such as the CPython GIL example
> (analyzer_gil_plugin). Please let me know of any additional pointers.

Yes.

There are also two example of "teaching" the analyzer about the
behavior of specific functions via subclassing known_function in:
  analyzer_known_fns_plugin.c
and:
 analyzer_kernel_plugin.c


> If there is time to spare, I think it is reasonable to extend the
> capabilities of the original checker as well (more details in the
> expected timeline below).
>=20
> Provide an expected timeline:
> I suspect the first task to take the longest since it is relatively
> involved and it also includes getting the initial infrastructure of
> the plugin up. From the experience of the first task, I hope the rest
> of the tasks would be implemented faster. Additionally, I understand
> that the current timeline outline below is too vague. I wished to
> check in with the community for some feedback on whether I am in the
> right ballpark before committing to more details.
>=20
> Week 1 - 7: Reference counting checking
> Week 8: Error-handling checking
> Week 9: Errors in exception handling
> Week 10: Format string checking
> Week 11: Verification of PyMethodDef tables
> Week 12: I am planning the last week to be safety in case any of the
> above tasks take longer than initially expected. In case everything
> goes smoothly and there is time to spare, I think it is reasonable to
> spend the time extending the capabilities of the original pass. Some
> ideas include extending the subset of CPython API that cpychecker
> currently support, matching up similar traces to solve the issue of
> duplicate error reports, and/or addressing any of the other caveats
> currently mentioned in the cpychecker documentation. Additional ideas
> are welcome of course.

FWIW I think it's a very ambitious project, but you seem capable.

You don't mention testing.  I'd expect part of the project to be the
creation of a regression test suite, with each step adding test
coverage for the features it adds.  There are lots of test cases in the
existing cpychecker test suite that you could reuse  - though beware,
the test harness there is very bad - I made multiple mistakes:
- expecting "gold" outputs from test cases - specific stderr strings,
which make the tests very brittle
- external scripts associated with .c files, to tell it how to invoke
the compiler, which make the tests a pain to maintain and extend.

GCC's own test suite handles this much better using DejaGnu where:
- we test for specific properties of interest in the behavior of each
test (rather than rigidly specifying everything about the behavior of
each test)
- the tests are expressed as .c files with "magic" comments containing
directives for the test harness

That said DejaGnu is implemented in Tcl, which is a pain to deal with;
you could reuse DejaGnu, or maybe Python might be a better choice; I'm
not sure.


It might be good to add new attributes to CPython's headers so that the
function declarations become self-descriptive about e.g. refererence-
counting semantics (in a way readable both to humans and to static
analysis tools).  If so, this part of the project would involve working
with the CPython development community, perhaps writing a PEP:
  https://peps.python.org/pep-0001/
Again, this would be an ambitious goal, probably better done after
there's a working prototype.


>=20
> Briefly introduce yourself and your skills and/or accomplishments:
> I am a current Masters in Computer Science student at Columbia
> University. I did my undergraduates at Bates College (B.A Math) and
> Columbia University (B.S Computer Science) respectively. My interests
> are primarily in systems, programming languages, and compilers.
>=20
> At Columbia, I work in the group led by Professor Stephen Edwards on
> the SSLANG programming language: a language built atop the Sparse
> Synchronous Model. For more formal information on the Sparse
> Synchronous Model, please take a look at "The Sparse Synchronous
> Model
> on Real Hardware" (2022). Please find our repo, along with my
> contributions, here: https://github.com/ssm-lang/sslang=C2=A0(my GitHub
> handle is @efric). My main contribution to the compiler so far
> involved adding a static inlining optimization pass with another
> SSLANG team member. Our implementation is mostly based on the work
> shown in "Secrets of the Glasgow Haskell Compiler Inliner" (2002),
> with modifications as necessary to better fit our goals. The current
> implementation is a work in progress and we are still working on
> adding (many) more features to it. My work in this project is written
> in Haskell.
>=20
> I also conduct research in the Columbia Systems Lab. Specifically, my
> group and I, advised by Professor Jason Nieh, work on secure
> containerization with respect to untrusted software systems. Armv9-A
> introduced Realms, secure execution environments that are opaque to
> untrusted operating systems, as part of the Arm Confidential Compute
> Architecture (CCA). Please find more information on CCA in "Design
> and
> Verification of the Arm Confidential Compute Architecture" (2022).
> Introduced together was the Realm Management Monitor (RMM), an
> interface for hypervisors to allow secure virtualization utilizing
> Realms and the new hardware support. Currently, the Realm isolation
> boundary is at the level of entire VMs. We are working on applying
> Realms to secure containers. Work in this project is mostly at the
> kernel and firmware level and is written in C and ARM assembly.
>=20
> Pertaining experience with compilers in addition to SSLANG, my
> undergraduate education included a class on compilers that involved
> writing passes for Clang/LLVM. More currently, I am taking a
> graduate-level class on Types, Languages, and Compilers where my
> partner and I are working on a plugin for our own small toy language
> compiler which would be able to perform type inference. The plugin
> would generate relevant constraints and solve them on behalf of the
> compiler. This project is still in its early stages, but the idea is
> to delegate type inference functionalities to a generic library given
> some information instead of having to write your own constraint
> solver.

It sounds like you may know more about the theory than I do!

>=20
> Thank you for reviewing my proposal!

Thanks for sending it; hope the above is helpful (and not too
intimidating!)

Dave