From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx0a-00364e01.pphosted.com (mx0a-00364e01.pphosted.com [148.163.135.74]) by sourceware.org (Postfix) with ESMTPS id 36FF13858D39 for ; Tue, 28 Mar 2023 12:08:18 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 36FF13858D39 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=columbia.edu Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=columbia.edu Received: from pps.filterd (m0167068.ppops.net [127.0.0.1]) by mx0a-00364e01.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 32SBeIO0006983 for ; Tue, 28 Mar 2023 08:08:17 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=columbia.edu; h=mime-version : references : in-reply-to : from : date : message-id : subject : to : cc : content-type : content-transfer-encoding; s=pps01; bh=nAcvTmdPidbwQtNjSLl5RIY3iqQCj0I+v0tIvRoI8u8=; b=uAs4EsL3SQFnA6+svo4pTOrHiAyJrlXoTY1V2/ihsSW5T9C8uS3YrQ3JxMVpjFQimdT/ uhkA3G8jMJASHP7UWB4Cade6RM7pniielWJCqRCMwVWIdCyseAUEOlciqHMV+0hOqck6 ln7IQLEsVoI/nOzPlSH2vhiteS7FNVEVIaVQZ4qBmePvh10E85XzWBb70O6oC65JHBb7 f1qhOp5p9GtAQGFcCoDhoGpVDzLbmxTeSUv/S2OliI2vQaWI3UfOTrzbc02GYC1c11tG Sx7LXS2Ddinq2EclTBLRZMljX4j/C6z8GoRTCOt9NkLPX1Iw3wLlmCKEngo8uHfUN8He jQ== Received: from sendprdmail20.cc.columbia.edu (sendprdmail20.cc.columbia.edu [128.59.72.22]) by mx0a-00364e01.pphosted.com (PPS) with ESMTPS id 3pjdy9r65k-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT) for ; Tue, 28 Mar 2023 08:08:16 -0400 Received: from mail-vs1-f71.google.com (mail-vs1-f71.google.com [209.85.217.71]) by sendprdmail20.cc.columbia.edu (8.14.7/8.14.4) with ESMTP id 32SC8FDH032100 (version=TLSv1/SSLv3 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Tue, 28 Mar 2023 08:08:15 -0400 Received: by mail-vs1-f71.google.com with SMTP id i21-20020a05610221b500b004258d5ee8c4so3558773vsb.0 for ; Tue, 28 Mar 2023 05:08:15 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1680005295; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=nAcvTmdPidbwQtNjSLl5RIY3iqQCj0I+v0tIvRoI8u8=; b=4mpMlign8mAfBNdumK50rEbRu6fvv+h/mxxTaYI7ia09ugRo/oMXZoF5DiUgaz6i5o LbKQyG06mRTbwQI5t5d0xm/9C2vr9mwI3rBi/2Y+T0VCU+2inK91HqK+kEmUHtG7ssZS jcLdHwaeYfrIGNxehbC81BkhN54fOEysyAfbobbKP8sQM//RUCwXI4NikHHMyin1swf4 whW0bsOZlS4PmKBCf4UfGEdR7ms5SQbuMqm0NrmYnADLl4sHpVMQehkKb7cd5RjdeVA3 HViE11T0YNL8X04H0gmJDdo34aSkr5jesgI36+GwWlEJEBBr3dynaPqOadUyYsVGCuCe +CTg== X-Gm-Message-State: AAQBX9dYtn3P4Waw4MU/az/Ir7if0iR0yWQ72jcUwfFKnCweWTmhDzKo rOnt+V0e4Eb2TYjHqHGT+dz8GYuSsdJxbq65s97YvqBXW1NrjGTwJUKCLmUTYXjinCUn8SKZItb hYgVI2AaZKtg2wIPnLGRuAIjS53o+KnWaBB824mPCRg== X-Received: by 2002:a67:c215:0:b0:425:eb13:b07d with SMTP id i21-20020a67c215000000b00425eb13b07dmr7934907vsj.4.1680005294463; Tue, 28 Mar 2023 05:08:14 -0700 (PDT) X-Google-Smtp-Source: AKy350ZtvgZWNeMcctZ+xn2YhEaki15D++2cza71ao5MvCIJsA3FIgMpBouQFTbiCmDuTiD47POCLGiNPs6qBBTRbQo= X-Received: by 2002:a67:c215:0:b0:425:eb13:b07d with SMTP id i21-20020a67c215000000b00425eb13b07dmr7934890vsj.4.1680005293766; Tue, 28 Mar 2023 05:08:13 -0700 (PDT) MIME-Version: 1.0 References: <9698600391b2cb611dfa8fee5540258ed0cafb1e.camel@redhat.com> In-Reply-To: <9698600391b2cb611dfa8fee5540258ed0cafb1e.camel@redhat.com> From: Eric Feng Date: Tue, 28 Mar 2023 08:08:03 -0400 Message-ID: Subject: Re: [GSoC] Interest and initial proposal for project on reimplementing cpychecker as -fanalyzer plugin To: David Malcolm Cc: gcc@gcc.gnu.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Proofpoint-ORIG-GUID: CFyRmgSQyMZWab4ThafIW_McxGJbC9ZS X-Proofpoint-GUID: CFyRmgSQyMZWab4ThafIW_McxGJbC9ZS X-CU-OB: Yes X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.254,Aquarius:18.0.942,Hydra:6.0.573,FMLib:17.11.170.22 definitions=2023-03-24_11,2023-03-28_01,2023-02-09_01 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 priorityscore=1501 lowpriorityscore=10 adultscore=0 impostorscore=10 bulkscore=10 mlxlogscore=999 malwarescore=0 mlxscore=0 phishscore=0 suspectscore=0 spamscore=0 clxscore=1015 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2303200000 definitions=main-2303280094 X-Spam-Status: No, score=-2.7 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,KAM_SHORT,RCVD_IN_DNSWL_LOW,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: My apologies. Forgot to CC the mailing list in my previous e-mail. Original reply below: _______ Hi David, Thank you for your feedback! > Also, the Python community would continue to find > static analysis of CPython extension modules useful, so it would be > good to have the idea live on as a GCC plugin on top of -fanalyzer. I hope so! > You should try building GCC from source, and hack in a trivial warning > that emits "hello world, I'm compiling function 'foo'". I wrote a > guide to GCC for new contributors here that should get you started: > https://gcc-newbies-guide.readthedocs.io/en/latest/ > This will help you get familiar with GCC's internals, and although the > plan is to write a plugin, I expect that you'll run into places where a > patch to GCC itself is more appropriate (bugs and missing functionality > ), so having your own debug build of GCC is a good idea. > > You should become familiar with CPython's extension and embedding API. > See the excellent documentation here: > https://docs.python.org/3/extending/extending.html > It's probably a good exercise to write your own trivial CPython > extension module. > > You can read the old cpychecker code inside the gcc-python-plugin > repository, and I gave a couple of talks on it as PyCon a decade ago: Sounds good. > > Error-handling checking: Various checks for common errors such as > > dereferencing a NULL value. > > Yes. This is already done by -fanalyzer, but we need some way for it > to know the outcomes of specific functions: e.g. one common pattern is > that API function "PyFoo_Bar" could either: > (a) succeed, returning a PyObject * that the caller "owns" a reference > to, or > (b) fail, returning NULL, and setting an exception on the thread-local > interpreter state object Sounds good. In other words, the infrastructure of this check is already there and our job is to retrofit CPython API-specific knowledge for this task. Please correct me if my understanding here is wrong. > > Errors in exception-handling: Checks for situations in which > > functions > > returning PyObject* that is NULL are not gracefully handled. > > Yes; detection of this would automatically happen if we implemented > known_function subclasses e.g. for the pattern above. Sounds good. I will merge this task with the previous one in the next iteration of this proposal since it will be handled as a side effect of implementing the previous task. > You don't mention testing. I'd expect part of the project to be the > creation of a regression test suite, with each step adding test > coverage for the features it adds. There are lots of test cases in the > existing cpychecker test suite that you could reuse - though beware, > the test harness there is very bad - I made multiple mistakes: > - expecting "gold" outputs from test cases - specific stderr strings, > which make the tests very brittle > - external scripts associated with .c files, to tell it how to invoke > the compiler, which make the tests a pain to maintain and extend. > > GCC's own test suite handles this much better using DejaGnu where: > - we test for specific properties of interest in the behavior of each > test (rather than rigidly specifying everything about the behavior of > each test) > - the tests are expressed as .c files with "magic" comments containing > directives for the test harness > > That said DejaGnu is implemented in Tcl, which is a pain to deal with; > you could reuse DejaGnu, or maybe Python might be a better choice; I'm > not sure. You're right, I forgot to mention that in the initial draft. Thank you for pointing that out. I agree with the bottom-up approach with respect to building a comprehensive regression test suite. In terms of specifically what to implement the suite in, I'll explore DejaGnu/Tcl in more detail before making a more informed decision. > It might be good to add new attributes to CPython's headers so that the > function declarations become self-descriptive about e.g. refererence- > counting semantics (in a way readable both to humans and to static > analysis tools). If so, this part of the project would involve working > with the CPython development community, perhaps writing a PEP: > https://peps.python.org/pep-0001/ > Again, this would be an ambitious goal, probably better done after > there's a working prototype. That would be very exciting. However, I'm not sure if I fully understand what you mean. Can you clarify by giving an example of what the new attributes you had in mind might look like and how they would help (for example with respect to reference counting semantics)? Incidentally, I forgot to mention in my previous email but I believe the 350-hour option is the one that is more appropriate for this project. Please let me know otherwise. Best, Eric On Sun, Mar 26, 2023 at 11:58=E2=80=AFAM David Malcolm wrote: > > On Sat, 2023-03-25 at 15:38 -0400, Eric Feng via Gcc wrote: > > Hi GCC community, > > > > For GSoC, I am extremely interested in working on the selected > > project > > idea with respect to extending the static analysis pass. In > > particular, porting gcc-python-plugin's cpychecker to a plugin for > > GCC > > -fanalyzer as described in > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D107646. > > Hi Eric, welcome to the GCC commmunity. > > I'm the author/maintainer of GCC's static analysis pass. I'm also the > author of gcc-python-plugin and its erstwhile "cpychecker" code, so I'm > pleased that you're interested in the project. > > I wrote gcc-python-plugin and cpychecker over a decade ago when I was > focused on CPython development (before I switched to GCC development), > but it's heavily bitrotted over the years, as I didn't have enough > cycles to keep it compatible with changes in both GCC and CPython > whilst working on GCC itself. In particular, the cpychecker code > stopped working a number of GCC releases ago. However, the cpychecker > code inspired much of my work on GCC's static analysis pass and on its > diagnostics subsystem, so much of it now lives on in C++ form as core > GCC functionality. Also, the Python community would continue to find > static analysis of CPython extension modules useful, so it would be > good to have the idea live on as a GCC plugin on top of -fanalyzer. > > > Please find an > > initial draft of my proposal below and let me know if it is a > > reasonable starting point. Please also correct me if I am > > misunderstanding any particular tasks and let me know what areas I > > should add more information for or what else I may do in preparation. > > Some ideas for familiarizing yourself with the problem space: > > You should try building GCC from source, and hack in a trivial warning > that emits "hello world, I'm compiling function 'foo'". I wrote a > guide to GCC for new contributors here that should get you started: > https://gcc-newbies-guide.readthedocs.io/en/latest/ > This will help you get familiar with GCC's internals, and although the > plan is to write a plugin, I expect that you'll run into places where a > patch to GCC itself is more appropriate (bugs and missing functionality > ), so having your own debug build of GCC is a good idea. > > You should become familiar with CPython's extension and embedding API. > See the excellent documentation here: > https://docs.python.org/3/extending/extending.html > It's probably a good exercise to write your own trivial CPython > extension module. > > You can read the old cpychecker code inside the gcc-python-plugin > repository, and I gave a couple of talks on it as PyCon a decade ago: > > PyCon2012: "Static analysis of Python extension modules using GCC" > https://pyvideo.org/pycon-us-2012/static-analysis-of-python-extension-mod= ules-using.html > > PyCon2013: "Death by a thousand leaks: what statically-analysing 370 > Python extensions looks like" > https://pyvideo.org/pycon-us-2013/death-by-a-thousand-leaks-what-statical= ly-analys.html > https://www.youtube.com/watch?v=3DbblvGKzZfFI > > (sorry about all the "ums" and "errs"; it's fascinating and > embarrassing to watch myself from 11 years ago on this, and see how > much I've both forgotten and learned in the meantime. Revisiting this > work, I'm ashamed to see that I was referring to the implementation as > based on "abstract interpretation" (and e.g. absinterp.py), when I now > realize it's actually based on symbolic execution (as is GCC's- > fanalyzer) > > Also, this was during the transition era between Python 2 and Python 3, > whereas now we only have to care about Python 3. > > There may be other caveats; I haven't fully rewatched those talks yet > :-/ > > Various comments inline below, throughout... > > > > > _______ > > > > Describe the project and clearly define its goals: > > One pertinent use case of the gcc-python plugin is as a static > > analysis tool for CPython extension modules. > > It might be more accurate to use the past tense when referring to the > gcc-python plugin, alas. > > > The main goal is to help > > programmers writing extensions identify common coding errors. > > Broadly, > > the goal of this project is to port the functionalities of cpychecker > > to a -fanalyzer plugin. > > (nods) > > > > > Below is a brief description of the functionalities of the static > > analysis tool for which I will work on porting over to a -fanalyzer > > plugin. The structure of the objectives is taken from the > > gcc-python-plugin documentation: > > > > Reference count checking: Manipulation of PyObjects is done via the > > CPython API and in particular with respect to the objects' reference > > count. When the reference count belonging to an object drops to zero, > > we should free all resources associated with it. This check helps > > ensure programmers identify problems with the reference count > > associated with an object. For example, memory leaks with respect to > > forgetting to decrement the reference count of an object (analogous > > to > > malloc() without corresponding free()) or perhaps object access after > > the object's reference count is zero (analogous to access after > > free()). > > (nods) > > > > Error-handling checking: Various checks for common errors such as > > dereferencing a NULL value. > > Yes. This is already done by -fanalyzer, but we need some way for it > to know the outcomes of specific functions: e.g. one common pattern is > that API function "PyFoo_Bar" could either: > (a) succeed, returning a PyObject * that the caller "owns" a reference > to, or > (b) fail, returning NULL, and setting an exception on the thread-local > interpreter state object > > > > > > Errors in exception-handling: Checks for situations in which > > functions > > returning PyObject* that is NULL are not gracefully handled. > > Yes; detection of this would automatically happen if we implemented > known_function subclasses e.g. for the pattern above. > > > > Format string checking: Verify that arguments to various CPython APIs > > which take format strings are correct. > > Have a look at: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D107017 > ("RFE: support printf-style formatted functions in -fanalyzer") and: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D100121 > ("RFE: plugin support for -Wformat via __attribute__((format()))") > > > > > > > Associating PyTypeObject instances with compile-time-types: Verify > > that the run-time type of a PyTypeObject matches its corresponding > > compile-time type for inputs where both are provided. > > (nods) > > > > > Verification of PyMethodDef tables: Verify that the function in > > PyMethodDef tables matches the calling convention of the ml_flags > > set. > > (nods) > > > > > I suspect a good starting point would be existing proof-of-concept > > -fanalyzer plugins such as the CPython GIL example > > (analyzer_gil_plugin). Please let me know of any additional pointers. > > Yes. > > There are also two example of "teaching" the analyzer about the > behavior of specific functions via subclassing known_function in: > analyzer_known_fns_plugin.c > and: > analyzer_kernel_plugin.c > > > > If there is time to spare, I think it is reasonable to extend the > > capabilities of the original checker as well (more details in the > > expected timeline below). > > > > Provide an expected timeline: > > I suspect the first task to take the longest since it is relatively > > involved and it also includes getting the initial infrastructure of > > the plugin up. From the experience of the first task, I hope the rest > > of the tasks would be implemented faster. Additionally, I understand > > that the current timeline outline below is too vague. I wished to > > check in with the community for some feedback on whether I am in the > > right ballpark before committing to more details. > > > > Week 1 - 7: Reference counting checking > > Week 8: Error-handling checking > > Week 9: Errors in exception handling > > Week 10: Format string checking > > Week 11: Verification of PyMethodDef tables > > Week 12: I am planning the last week to be safety in case any of the > > above tasks take longer than initially expected. In case everything > > goes smoothly and there is time to spare, I think it is reasonable to > > spend the time extending the capabilities of the original pass. Some > > ideas include extending the subset of CPython API that cpychecker > > currently support, matching up similar traces to solve the issue of > > duplicate error reports, and/or addressing any of the other caveats > > currently mentioned in the cpychecker documentation. Additional ideas > > are welcome of course. > > FWIW I think it's a very ambitious project, but you seem capable. > > You don't mention testing. I'd expect part of the project to be the > creation of a regression test suite, with each step adding test > coverage for the features it adds. There are lots of test cases in the > existing cpychecker test suite that you could reuse - though beware, > the test harness there is very bad - I made multiple mistakes: > - expecting "gold" outputs from test cases - specific stderr strings, > which make the tests very brittle > - external scripts associated with .c files, to tell it how to invoke > the compiler, which make the tests a pain to maintain and extend. > > GCC's own test suite handles this much better using DejaGnu where: > - we test for specific properties of interest in the behavior of each > test (rather than rigidly specifying everything about the behavior of > each test) > - the tests are expressed as .c files with "magic" comments containing > directives for the test harness > > That said DejaGnu is implemented in Tcl, which is a pain to deal with; > you could reuse DejaGnu, or maybe Python might be a better choice; I'm > not sure. > > > It might be good to add new attributes to CPython's headers so that the > function declarations become self-descriptive about e.g. refererence- > counting semantics (in a way readable both to humans and to static > analysis tools). If so, this part of the project would involve working > with the CPython development community, perhaps writing a PEP: > https://peps.python.org/pep-0001/ > Again, this would be an ambitious goal, probably better done after > there's a working prototype. > > > > > > Briefly introduce yourself and your skills and/or accomplishments: > > I am a current Masters in Computer Science student at Columbia > > University. I did my undergraduates at Bates College (B.A Math) and > > Columbia University (B.S Computer Science) respectively. My interests > > are primarily in systems, programming languages, and compilers. > > > > At Columbia, I work in the group led by Professor Stephen Edwards on > > the SSLANG programming language: a language built atop the Sparse > > Synchronous Model. For more formal information on the Sparse > > Synchronous Model, please take a look at "The Sparse Synchronous > > Model > > on Real Hardware" (2022). Please find our repo, along with my > > contributions, here: https://github.com/ssm-lang/sslang (my GitHub > > handle is @efric). My main contribution to the compiler so far > > involved adding a static inlining optimization pass with another > > SSLANG team member. Our implementation is mostly based on the work > > shown in "Secrets of the Glasgow Haskell Compiler Inliner" (2002), > > with modifications as necessary to better fit our goals. The current > > implementation is a work in progress and we are still working on > > adding (many) more features to it. My work in this project is written > > in Haskell. > > > > I also conduct research in the Columbia Systems Lab. Specifically, my > > group and I, advised by Professor Jason Nieh, work on secure > > containerization with respect to untrusted software systems. Armv9-A > > introduced Realms, secure execution environments that are opaque to > > untrusted operating systems, as part of the Arm Confidential Compute > > Architecture (CCA). Please find more information on CCA in "Design > > and > > Verification of the Arm Confidential Compute Architecture" (2022). > > Introduced together was the Realm Management Monitor (RMM), an > > interface for hypervisors to allow secure virtualization utilizing > > Realms and the new hardware support. Currently, the Realm isolation > > boundary is at the level of entire VMs. We are working on applying > > Realms to secure containers. Work in this project is mostly at the > > kernel and firmware level and is written in C and ARM assembly. > > > > Pertaining experience with compilers in addition to SSLANG, my > > undergraduate education included a class on compilers that involved > > writing passes for Clang/LLVM. More currently, I am taking a > > graduate-level class on Types, Languages, and Compilers where my > > partner and I are working on a plugin for our own small toy language > > compiler which would be able to perform type inference. The plugin > > would generate relevant constraints and solve them on behalf of the > > compiler. This project is still in its early stages, but the idea is > > to delegate type inference functionalities to a generic library given > > some information instead of having to write your own constraint > > solver. > > It sounds like you may know more about the theory than I do! > > > > > Thank you for reviewing my proposal! > > Thanks for sending it; hope the above is helpful (and not too > intimidating!) > > Dave >