From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=vm5e=DL=redhat.com=dmalcolm@sourceware.org>
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	by sourceware.org (Postfix) with ESMTPS id 17F603857710
	for <gcc@gcc.gnu.org>; Tue, 25 Jul 2023 14:41:27 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 17F603857710
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=redhat.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1690296085;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=aBY270CN08jkVHqjpMPAtxR0Kep9VQJndpGUkA7xXS8=;
	b=CBYbaDdN0XnDeVjzO3XcDjEJjtqjbbne0o6VRYdQRoV37gtmvnGXMWr0ZEPwG1f8fS7k9Y
	NEzqQNJOCNsmsq8h9FjJrvuP7N6o7DHYnIyX+ywPcVOmQB54Etly1vcgh35n0VVufnOm0z
	dSUM3D0CheQ2PmUfLDI31245DfIAl14=
Received: from mail-ot1-f69.google.com (mail-ot1-f69.google.com
 [209.85.210.69]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-422-lPZfDK9JPtCvUe0QLaG3gQ-1; Tue, 25 Jul 2023 10:41:24 -0400
X-MC-Unique: lPZfDK9JPtCvUe0QLaG3gQ-1
Received: by mail-ot1-f69.google.com with SMTP id 46e09a7af769-6bc59b0fff5so435268a34.0
        for <gcc@gcc.gnu.org>; Tue, 25 Jul 2023 07:41:24 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1690296083; x=1690900883;
        h=mime-version:user-agent:content-transfer-encoding:references
         :in-reply-to:date:to:from:subject:message-id:x-gm-message-state:from
         :to:cc:subject:date:message-id:reply-to;
        bh=aBY270CN08jkVHqjpMPAtxR0Kep9VQJndpGUkA7xXS8=;
        b=J1YeUsXrVz/JPOI+wqbxlvoRYRWf2xBi321j92EcLgrHA6CzQLve+oWEeCZGJZ5kdx
         gxA9g7Q4LbnNwV02kUGlXpvoZrqtkWyWxnsO4Gg7MXWIyZI0xZmU4UfLdIsVrCX5uc/T
         lROJMSWIkzKz+mfPeFejIaUNZGoJQxyA3VANm4oQhBpevNWrvwzu7n1tJ1ihiN3wDydx
         fexOroeqttG/ze0GM7sljAZCMC4USdkx3TQ1E2brguUpe89+LLtNRVHaa/G6trKpj1Ag
         Lrb6EoGeENQvxvkKvKcJMrVd7rp1jydwsSmBMDKWN6uqZKFjU7DR7Y7GUcD6uxcpMORq
         zKzw==
X-Gm-Message-State: ABy/qLarlQalKHP3AzDtzD65Bym6IyeV3Q1Dl7lLcG8WHM79JmuOjJAp
	dZRmGNc/bT7S72baZ/1wNR3bOEC3rU/dtARpVTxaC5QcQq23OjJbltSft0mXsTc97Mz3lIo7XLn
	gdiLwKmasZineSqM=
X-Received: by 2002:a9d:7344:0:b0:6aa:ecb5:f186 with SMTP id l4-20020a9d7344000000b006aaecb5f186mr11263231otk.7.1690296083406;
        Tue, 25 Jul 2023 07:41:23 -0700 (PDT)
X-Google-Smtp-Source: APBJJlGOP+7eqNEHIO6C3aVziykr3UdR58GVXfHgWM2fAJaZspyDCKribq0Aqfl48KxPlq4QAgUjXg==
X-Received: by 2002:a9d:7344:0:b0:6aa:ecb5:f186 with SMTP id l4-20020a9d7344000000b006aaecb5f186mr11263216otk.7.1690296083135;
        Tue, 25 Jul 2023 07:41:23 -0700 (PDT)
Received: from t14s.localdomain (c-76-28-97-5.hsd1.ma.comcast.net. [76.28.97.5])
        by smtp.gmail.com with ESMTPSA id ou26-20020a05620a621a00b00767b37256ecsm3703506qkn.107.2023.07.25.07.41.22
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 25 Jul 2023 07:41:22 -0700 (PDT)
Message-ID: <c03cd66d5ca77e045cf3d1bb0f81566eb60c868e.camel@redhat.com>
Subject: Re: Update and Questions on CPython Extension Module -fanalyzer
 plugin development
From: David Malcolm <dmalcolm@redhat.com>
To: Eric Feng <ef2648@columbia.edu>, gcc@gcc.gnu.org
Date: Tue, 25 Jul 2023 10:41:21 -0400
In-Reply-To: <CANGHATX0p3TX83ioNO7uzrJLZB4YYSLnDU3FbAfY1NyUi-77_A@mail.gmail.com>
References: <CANGHATX0p3TX83ioNO7uzrJLZB4YYSLnDU3FbAfY1NyUi-77_A@mail.gmail.com>
User-Agent: Evolution 3.44.4 (3.44.4-2.fc36)
MIME-Version: 1.0
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-5.3 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H4,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc.gcc.gnu.org>

On Tue, 2023-07-25 at 00:49 -0400, Eric Feng wrote:
> Hi all,

Hi Eric, thanks for the update.

Various comments inline below...

>=20
> I would like to update everyone on the progress of the static
> analyzer
> plugin for CPython extension module code.
>  Since the last update, I
> have implemented known function subclasses for PyList_New and
> PyList_Append. The existing known function subclasses have also been
> enhanced to provide more information. For instance, we are now
> simulating object type specific fields in addition to just ob_refcnt
> and ob_type, which are shared by all PyObjects.

Do you have any DejaGnu tests for this functionality?  For example,
given PyList_New
  https://docs.python.org/3/c-api/list.html#c.PyList_New
there could be a test like:

/* { dg-require-effective-target python_h } */

#define PY_SSIZE_T_CLEAN
#include <Python.h>
#include "analyzer-decls.h"

PyObject *
test_PyList_New (Py_ssize_t len)
{
  PyObject *obj =3D PyList_New (len);
  if (obj)
    {
     __analyzer_eval (obj->ob_refcnt =3D=3D 1); /* { dg-warning "TRUE" } */
     __analyzer_eval (PyList_Check (obj)); /* { dg-warning "TRUE" } */
     __analyzer_eval (PyList_CheckExact (obj)); /* { dg-warning "TRUE" } */
    }
  else
    __analyzer_dump_path (); /* { dg-warning "path" } */
  return obj;
}

...or similar, to verify that we simulate that the call can both
succeed and fail, and to verify properties of the store along the
"success" path.  Caveat: I didn't look at exactly what properties
you're simulating, so the above tests might need adjusting.

The:
  /* { dg-require-effective-target python_h } */
allows the testcase to bail out with "UNSUPPORTED" on hosts that don't
have a suitable Python.h header file installed, so that all this
Python-specific functionality is optional.  You could implement this
via a new function "check_effective_target_python_h" in
gcc/testsuite/lib/target-supports.exp, similar to the existing
check_effective_target_ functions in that Tcl file.


>=20
> Regarding reference count checking, I have implemented a naive
> traversal of the store to count the actual reference count of
> PyObjects, allowing us to compare it against the ob_refcnt fields of
> the same PyObjects. Although we can compare the actual reference count
> and the ob_refcnt field, I am still working on implementing a
> diagnostic to alert about this issue.

Sounds promising.

>=20
> In addition to the progress update, I have some implementation
> related
> questions and would appreciate any input. The current moment at which
> we run the algorithm for reference count checking, and thereby also
> the moment at which we may want to issue
> impl_region_model_context::warn, is within region_model::pop_frame.
> However, it appears that m_stmt and m_stmt_finder are NULL at the
> time
> of region_model::pop_frame, which results in the diagnostic for the
> reference count getting rejected. I am having trouble finding a
> workaround for this issue, so any ideas would be welcome.

FWIW I've run into a similar issue, and so did Tim last year.  The
whole stmt vs stmt_finder thing is rather ugly, and I have a local
branch with a part-written reworking of it that I hope will solve these
issues (adding a "class pending_location"), but it's very messy and far
from being ready.  Sorry about this.

In theory you can provide an instance of your own custom stmt_finder
subclass when you save the pending_diagnostic.  There's an example of
doing this in engine.cc: impl_region_model_context::on_state_leak,
where the leak is going to be reported at the end of a function
(similar to your diagnostic); it uses a leak_stmt_finder class, which
simply scans backwards once it has an exploded_path to find the last
stmt that had a useful location_t value.  This is a nasty hack, but
probably does what you need.

>=20
> I am also currently examining some issues related to state merging.

Note that you can use -fno-analyzer-state-merge to turn off the state
merging code, which can be very useful when debugging this kind of
thing.

> Let's consider the following example which lacks error checking:
>=20
> PyObject* foo() {
> =C2=A0=C2=A0=C2=A0 PyObject item =3D PyLong_FromLong(10);
> =C2=A0=C2=A0=C2=A0 PyObject list =3D PyList_New(5);
> =C2=A0=C2=A0=C2=A0 return list;
> }
>=20
> The states for when PyLong_FromLong fails and when PyLong_FromLong
> succeeds are merged before the call to PyObject* list =3D
> PyList_New(5).

Ideally we would emit a leak warning about the "success" case of
PyLong_FromLong here.  I think you're running into the problem of the
"store" part of the program_state being separate from the "malloc"
state machine part of program_state - I'm guessing that you're creating
a heap_allocated_region for the new python object, but the "malloc"
state machine isn't transitioning the pointer from "start" to "assumed-
non-null".  Such state machine states inhibit state-merging, and so
this might solve your state-merging problem.

I think we need a way to call malloc_state_machine::on_allocator_call
from outside of sm-malloc.cc.  See region_model::on_realloc_with_move
for an example of how to do something similar.


> I suspect this may be related to me not correctly handling behavior
> that arises due to the analyzer deterministically selecting the IDs
> for heap allocations. Since there is a heap allocation for PyList_New
> following PyLong_FromLong, the success and fail cases for
> PyLong_FromLong are merged. I believe this is so that in the scenario
> where PyLong_FromLong fails and PyList_New succeeds, the ID for the
> region allocated for PyList_New wouldn't be the same as the
> PyLong_FromLong success case. Whatever the cause, due to this state
> merge, the heap allocated region representing PyObject *item has all
> its fields set to UNKNOWN, making it impossible to perform the
> reference count checking functionality. I attempted to fix this by
> wrapping the svalue representing PyLongObject with
> get_or_create_unmergeable, but it didn't seem to help.=C2=A0

Strange; I would have thought that would have fixed it.  Can you post
the specific code you tried?

> However, this
> issue doesn't occur in all situations. For instance:
>=20
> PyObject* foo() {
> =C2=A0=C2=A0=C2=A0 PyObject item =3D PyLong_FromLong(10);
> =C2=A0=C2=A0=C2=A0 PyObject list =3D PyList_New(5);
> =C2=A0=C2=A0=C2=A0 PyList_Append(list, item);
> =C2=A0=C2=A0=C2=A0 return list;
> }
>=20
> The above scenario is simulated as expected. I will continue to
> search
> for a solution, but any suggestions would be highly appreciated.
> Thank
> you!

Thanks again for the update; hope the above is helpful
Dave