From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by sourceware.org (Postfix) with ESMTPS id CF3793858D39 for ; Sun, 26 Mar 2023 15:58:10 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org CF3793858D39 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1679846290; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=NalkeBVjttTf+kG/TgLIsiZgyVC/ogsxmWpA0WMAr7s=; b=ZAvfs7CLvyPeXSV1RSdV0qqfpy4KMh3TOkGtuXuACqAdxTv723+z1VA+B7cDU//BMXRNL+ cJlAWFdO5Yzz/FWyotchKwm7GcZEhjPNm7VQTTutrf4CsEgVOUbd4XopMbmrk1ayzAnxrV uVRo4/ldzdZ1mTDL+0A0+hD7pj2KJiQ= Received: from mail-qk1-f197.google.com (mail-qk1-f197.google.com [209.85.222.197]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-81-53WgGC7oP6Ke1IyC_X50dw-1; Sun, 26 Mar 2023 11:58:09 -0400 X-MC-Unique: 53WgGC7oP6Ke1IyC_X50dw-1 Received: by mail-qk1-f197.google.com with SMTP id r70-20020a374449000000b00746c31401f0so2915282qka.6 for ; Sun, 26 Mar 2023 08:58:09 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1679846288; h=mime-version:user-agent:content-transfer-encoding:references :in-reply-to:date:to:from:subject:message-id:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=NalkeBVjttTf+kG/TgLIsiZgyVC/ogsxmWpA0WMAr7s=; b=ISHj5p1OgOKQet4FT3vSRTMAMITrKvrDyRCND8tayN/04J54IbxLSofTF9EYW0llTq 6bjfV7m+QaqNcwhOvsBKEXiOeXkqkDIMKbY5f98Q1rDQT3/s3XXnVZU5/Z3WSdSwObcm Nw2J4ddRfKNqxyxlzlCNdmnxC2ydVOCmtPhLB6vP21zspfVjtIzPeMPwLFhXSeWhs6I6 Ha1HUEvhNp2ybEph1JOrK3AMOAdjBkXY3pxhl1MLK1ENBouWOv4S1SpsT4CdkTIJczft w1ZD77xi/Fqz+QK7KHVGyGRoSP9BHju0XoYEpzQhjekGjGhlQ/uU6A4su8IzOGxVzZyc nwvw== X-Gm-Message-State: AAQBX9eLXvhkzeTq5ppg/N0BWE7ROHjEcexdDakf9Y/I0nNXMttJosCV PNEXyk8Ne4q86ns3n/AXD6oiAmigQ/MSglKj/4AkchQw2v4NI4fSPkxGy30D9o9beRs7xHuCFiy e8rmKa7W/OJlA62I= X-Received: by 2002:a05:6214:20aa:b0:56e:93de:59c6 with SMTP id 10-20020a05621420aa00b0056e93de59c6mr14429674qvd.37.1679846288012; Sun, 26 Mar 2023 08:58:08 -0700 (PDT) X-Google-Smtp-Source: AKy350byLPGrKykNSRRMbEsiVaaABTzvPhpKNhPDHmMLu6DVMV24K0yW2rZlqfhXjcZbY8aSfVDpLg== X-Received: by 2002:a05:6214:20aa:b0:56e:93de:59c6 with SMTP id 10-20020a05621420aa00b0056e93de59c6mr14429654qvd.37.1679846287519; Sun, 26 Mar 2023 08:58:07 -0700 (PDT) Received: from t14s.localdomain (c-73-69-212-193.hsd1.ma.comcast.net. [73.69.212.193]) by smtp.gmail.com with ESMTPSA id j1-20020a378701000000b007456b51ee13sm15982010qkd.16.2023.03.26.08.58.06 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 26 Mar 2023 08:58:07 -0700 (PDT) Message-ID: <9698600391b2cb611dfa8fee5540258ed0cafb1e.camel@redhat.com> Subject: Re: [GSoC] Interest and initial proposal for project on reimplementing cpychecker as -fanalyzer plugin From: David Malcolm To: Eric Feng , gcc@gcc.gnu.org Date: Sun, 26 Mar 2023 11:58:00 -0400 In-Reply-To: References: User-Agent: Evolution 3.44.4 (3.44.4-1.fc36) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-5.7 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,KAM_SHORT,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Sat, 2023-03-25 at 15:38 -0400, Eric Feng via Gcc wrote: > Hi GCC community, >=20 > For GSoC, I am extremely interested in working on the selected > project > idea with respect to extending the static analysis pass. In > particular, porting gcc-python-plugin's cpychecker to a plugin for > GCC > -fanalyzer as described in > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D107646. Hi Eric, welcome to the GCC commmunity. I'm the author/maintainer of GCC's static analysis pass. I'm also the author of gcc-python-plugin and its erstwhile "cpychecker" code, so I'm pleased that you're interested in the project. I wrote gcc-python-plugin and cpychecker over a decade ago when I was focused on CPython development (before I switched to GCC development), but it's heavily bitrotted over the years, as I didn't have enough cycles to keep it compatible with changes in both GCC and CPython whilst working on GCC itself. In particular, the cpychecker code stopped working a number of GCC releases ago. However, the cpychecker code inspired much of my work on GCC's static analysis pass and on its diagnostics subsystem, so much of it now lives on in C++ form as core GCC functionality. Also, the Python community would continue to find static analysis of CPython extension modules useful, so it would be good to have the idea live on as a GCC plugin on top of -fanalyzer. > Please find an > initial draft of my proposal below and let me know if it is a > reasonable starting point. Please also correct me if I am > misunderstanding any particular tasks and let me know what areas I > should add more information for or what else I may do in preparation. Some ideas for familiarizing yourself with the problem space: You should try building GCC from source, and hack in a trivial warning that emits "hello world, I'm compiling function 'foo'". I wrote a guide to GCC for new contributors here that should get you started: https://gcc-newbies-guide.readthedocs.io/en/latest/ This will help you get familiar with GCC's internals, and although the plan is to write a plugin, I expect that you'll run into places where a patch to GCC itself is more appropriate (bugs and missing functionality ), so having your own debug build of GCC is a good idea. You should become familiar with CPython's extension and embedding API. See the excellent documentation here: https://docs.python.org/3/extending/extending.html It's probably a good exercise to write your own trivial CPython extension module. You can read the old cpychecker code inside the gcc-python-plugin repository, and I gave a couple of talks on it as PyCon a decade ago: PyCon2012: "Static analysis of Python extension modules using GCC" https://pyvideo.org/pycon-us-2012/static-analysis-of-python-extension-modul= es-using.html PyCon2013: "Death by a thousand leaks: what statically-analysing 370 Python extensions looks like" https://pyvideo.org/pycon-us-2013/death-by-a-thousand-leaks-what-statically= -analys.html https://www.youtube.com/watch?v=3DbblvGKzZfFI (sorry about all the "ums" and "errs"; it's fascinating and embarrassing to watch myself from 11 years ago on this, and see how much I've both forgotten and learned in the meantime. Revisiting this work, I'm ashamed to see that I was referring to the implementation as based on "abstract interpretation" (and e.g. absinterp.py), when I now realize it's actually based on symbolic execution (as is GCC's- fanalyzer) Also, this was during the transition era between Python 2 and Python 3, whereas now we only have to care about Python 3. There may be other caveats; I haven't fully rewatched those talks yet :-/ Various comments inline below, throughout... >=20 > _______ >=20 > Describe the project and clearly define its goals: > One pertinent use case of the gcc-python plugin is as a static > analysis tool for CPython extension modules. It might be more accurate to use the past tense when referring to the gcc-python plugin, alas. > The main goal is to help > programmers writing extensions identify common coding errors. > Broadly, > the goal of this project is to port the functionalities of cpychecker > to a -fanalyzer plugin. (nods) >=20 > Below is a brief description of the functionalities of the static > analysis tool for which I will work on porting over to a -fanalyzer > plugin. The structure of the objectives is taken from the > gcc-python-plugin documentation: >=20 > Reference count checking: Manipulation of PyObjects is done via the > CPython API and in particular with respect to the objects' reference > count. When the reference count belonging to an object drops to zero, > we should free all resources associated with it. This check helps > ensure programmers identify problems with the reference count > associated with an object. For example, memory leaks with respect to > forgetting to decrement the reference count of an object (analogous > to > malloc() without corresponding free()) or perhaps object access after > the object's reference count is zero (analogous to access after > free()). (nods) >=20 > Error-handling checking: Various checks for common errors such as > dereferencing a NULL value. Yes. This is already done by -fanalyzer, but we need some way for it to know the outcomes of specific functions: e.g.=C2=A0one common pattern is that API function "PyFoo_Bar" could either: (a) succeed, returning a PyObject * that the caller "owns" a reference to, or (b) fail, returning NULL, and setting an exception on the thread-local interpreter state object >=20 > Errors in exception-handling: Checks for situations in which > functions > returning PyObject* that is NULL are not gracefully handled. Yes; detection of this would automatically happen if we implemented known_function subclasses e.g. for the pattern above. >=20 > Format string checking: Verify that arguments to various CPython APIs > which take format strings are correct. Have a look at: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D107017 ("RFE: support printf-style formatted functions in -fanalyzer") and: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D100121 ("RFE: plugin support for -Wformat via __attribute__((format()))") >=20 > Associating PyTypeObject instances with compile-time-types: Verify > that the run-time type of a PyTypeObject matches its corresponding > compile-time type for inputs where both are provided. (nods) >=20 > Verification of PyMethodDef tables: Verify that the function in > PyMethodDef tables matches the calling convention of the ml_flags > set. (nods) >=20 > I suspect a good starting point would be existing proof-of-concept > -fanalyzer plugins such as the CPython GIL example > (analyzer_gil_plugin). Please let me know of any additional pointers. Yes. There are also two example of "teaching" the analyzer about the behavior of specific functions via subclassing known_function in: analyzer_known_fns_plugin.c and: analyzer_kernel_plugin.c > If there is time to spare, I think it is reasonable to extend the > capabilities of the original checker as well (more details in the > expected timeline below). >=20 > Provide an expected timeline: > I suspect the first task to take the longest since it is relatively > involved and it also includes getting the initial infrastructure of > the plugin up. From the experience of the first task, I hope the rest > of the tasks would be implemented faster. Additionally, I understand > that the current timeline outline below is too vague. I wished to > check in with the community for some feedback on whether I am in the > right ballpark before committing to more details. >=20 > Week 1 - 7: Reference counting checking > Week 8: Error-handling checking > Week 9: Errors in exception handling > Week 10: Format string checking > Week 11: Verification of PyMethodDef tables > Week 12: I am planning the last week to be safety in case any of the > above tasks take longer than initially expected. In case everything > goes smoothly and there is time to spare, I think it is reasonable to > spend the time extending the capabilities of the original pass. Some > ideas include extending the subset of CPython API that cpychecker > currently support, matching up similar traces to solve the issue of > duplicate error reports, and/or addressing any of the other caveats > currently mentioned in the cpychecker documentation. Additional ideas > are welcome of course. FWIW I think it's a very ambitious project, but you seem capable. You don't mention testing. I'd expect part of the project to be the creation of a regression test suite, with each step adding test coverage for the features it adds. There are lots of test cases in the existing cpychecker test suite that you could reuse - though beware, the test harness there is very bad - I made multiple mistakes: - expecting "gold" outputs from test cases - specific stderr strings, which make the tests very brittle - external scripts associated with .c files, to tell it how to invoke the compiler, which make the tests a pain to maintain and extend. GCC's own test suite handles this much better using DejaGnu where: - we test for specific properties of interest in the behavior of each test (rather than rigidly specifying everything about the behavior of each test) - the tests are expressed as .c files with "magic" comments containing directives for the test harness That said DejaGnu is implemented in Tcl, which is a pain to deal with; you could reuse DejaGnu, or maybe Python might be a better choice; I'm not sure. It might be good to add new attributes to CPython's headers so that the function declarations become self-descriptive about e.g. refererence- counting semantics (in a way readable both to humans and to static analysis tools). If so, this part of the project would involve working with the CPython development community, perhaps writing a PEP: https://peps.python.org/pep-0001/ Again, this would be an ambitious goal, probably better done after there's a working prototype. >=20 > Briefly introduce yourself and your skills and/or accomplishments: > I am a current Masters in Computer Science student at Columbia > University. I did my undergraduates at Bates College (B.A Math) and > Columbia University (B.S Computer Science) respectively. My interests > are primarily in systems, programming languages, and compilers. >=20 > At Columbia, I work in the group led by Professor Stephen Edwards on > the SSLANG programming language: a language built atop the Sparse > Synchronous Model. For more formal information on the Sparse > Synchronous Model, please take a look at "The Sparse Synchronous > Model > on Real Hardware" (2022). Please find our repo, along with my > contributions, here: https://github.com/ssm-lang/sslang=C2=A0(my GitHub > handle is @efric). My main contribution to the compiler so far > involved adding a static inlining optimization pass with another > SSLANG team member. Our implementation is mostly based on the work > shown in "Secrets of the Glasgow Haskell Compiler Inliner" (2002), > with modifications as necessary to better fit our goals. The current > implementation is a work in progress and we are still working on > adding (many) more features to it. My work in this project is written > in Haskell. >=20 > I also conduct research in the Columbia Systems Lab. Specifically, my > group and I, advised by Professor Jason Nieh, work on secure > containerization with respect to untrusted software systems. Armv9-A > introduced Realms, secure execution environments that are opaque to > untrusted operating systems, as part of the Arm Confidential Compute > Architecture (CCA). Please find more information on CCA in "Design > and > Verification of the Arm Confidential Compute Architecture" (2022). > Introduced together was the Realm Management Monitor (RMM), an > interface for hypervisors to allow secure virtualization utilizing > Realms and the new hardware support. Currently, the Realm isolation > boundary is at the level of entire VMs. We are working on applying > Realms to secure containers. Work in this project is mostly at the > kernel and firmware level and is written in C and ARM assembly. >=20 > Pertaining experience with compilers in addition to SSLANG, my > undergraduate education included a class on compilers that involved > writing passes for Clang/LLVM. More currently, I am taking a > graduate-level class on Types, Languages, and Compilers where my > partner and I are working on a plugin for our own small toy language > compiler which would be able to perform type inference. The plugin > would generate relevant constraints and solve them on behalf of the > compiler. This project is still in its early stages, but the idea is > to delegate type inference functionalities to a generic library given > some information instead of having to write your own constraint > solver. It sounds like you may know more about the theory than I do! >=20 > Thank you for reviewing my proposal! Thanks for sending it; hope the above is helpful (and not too intimidating!) Dave