Re: Update on CPython Extension Module -fanalyzer plugin development

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

From: David Malcolm <dmalcolm@redhat.com>
To: Eric Feng <ef2648@columbia.edu>
Cc: gcc@gcc.gnu.org
Subject: Re: Update on CPython Extension Module -fanalyzer plugin development
Date: Fri, 25 Aug 2023 15:50:09 -0400	[thread overview]
Message-ID: <c720fc39d86cb3ae8041fe5c86dad34d71466cec.camel@redhat.com> (raw)
In-Reply-To: <20230825125056.99826-1-ef2648@columbia.edu>

On Fri, 2023-08-25 at 08:50 -0400, Eric Feng wrote:
> Hi Dave,
> 
> Please find an updated WIP patch on reference count checking below. Some
> parts aren't properly formatted yet; I apologize for that.
> 
> Since the last WIP patch, the major updates include:
> - Updated certain areas of the core analyzer to support custom stmt_finder.
> - A significant revamp of the reference count checking logic.
> - __analyzer_cpython_dump_refcounts: This dumps reference count related information.

Thanks for the patch.

Given the scope, this is close to being ready for trunk; various
comments inline below...


> 
> Here's an updated look at the current WIP diagnostic we issue:
> rc3.c:25:10: warning: Expected <variable name belonging to m_base_region> to have reference count: ‘1’ but ob_refcnt field is: ‘2’
>    25 |   return list;
>       |          ^~~~
>   ‘create_py_object’: events 1-4
>     |
>     |    4 |   PyObject* item = PyLong_FromLong(3);
>     |      |                    ^~~~~~~~~~~~~~~~~~
>     |      |                    |
>     |      |                    (1) when ‘PyLong_FromLong’ succeeds
>     |    5 |   PyObject* list = PyList_New(1);
>     |      |                    ~~~~~~~~~~~~~
>     |      |                    |
>     |      |                    (2) when ‘PyList_New’ succeeds
>     |......
>     |   16 |   PyList_Append(list, item);
>     |      |   ~~~~~~~~~~~~~~~~~~~~~~~~~
>     |      |   |
>     |      |   (3) when ‘PyList_Append’ succeeds, moving buffer
>     |......
>     |   25 |   return list;
>     |      |          ~~~~
>     |      |          |
>     |      |          (4) here
>     |
> 

Looks good.

As mentioned in our chat, ideally there would be some kind of
information to the user as to what the reference actually are.

Perhaps it could be a series of notes like this:

note: there is 1 pointer to "*item"
note: (1) '*item' is pointed to by '((PyListObject *)list)->ob_item[0]'

Or, to be even more fancy, we could add events to the execution path
whenever the number of pointers or ob_refcnt changes, giving something
like:

(1) when ‘PyLong_FromLong’ succeeds
(2) '(*item).ob_refcnt' is now 1
(3) when ‘PyList_New’ succeeds
(4) '*item'  is now pointed to by '((PyListObject *)list)->ob_item[0]';
pointer count is 2, ob_refcnt is 1
(5) when 'create_py_object' returns, '(*item).ob_refcnt' is 1

...or somesuch (I'm brainstorming here, and I expect that the above
might not be implementable, due to local variables and SSA names
confusing things).

Plus there's the question of whether the return value "owns" a
reference. 

But all of that can be deferred to follow-up work.


> The reference count checking logic for v1 should be almost complete.
> Currently, it supports situations where the returned object is newly created.
> It doesn't yet support the other case (i.e., incremented by 1 from
> what it was previously, if not newly created). 
> 
> This week, I've focused primarily on the reference count checking logic. I plan
> to shift my focus to refining the diagnostic next. As seen in the placeholder
> diagnostic message above, I believe we should at least inform the user about the
> variable name associated with the region that has an unexpected reference count
> issue (in this case, 'item'). Initially, I suspect the issue might be that:
> 
> tree reg_tree = model->get_representative_tree (curr_region);
> 
> returns NULL since curr_region is heap allocated and thus the path_var returned would be:
> 
> path_var (NULL_TREE, 0);
> 
> Which means that:
> 
> refcnt_stmt_finder finder (*eg, reg_tree);
> 
> always receives a NULL_TREE, causing it to always default to the
> not_found case. A workaround might be necessary, but I haven't delved too deeply into this yet,
> so my suspicion could be off.
> 

You could try looking for a representative tree for the pointer, rather
than the region.

I ran into similar issues with leak detection: we report a leak when
nothing is pointing at the region, but if nothing is pointing at the
region, then there's nothing to label/describe the region with.  A
workaround I've used is to look at the old model immediately before the
state change, rather than the new one.

So for example, use the region_model * for immediately before popping
the frame (if it's available; you might need to make a copy?), and do
something like:

  tree expr = old_model->get_representative_tree (curr_region);



>  Additionally, I think it would be helpful to show users what the
> ob_refcnt looks like in each event as well.
> 

You could try make a custom_event subclass for such ob_refcnt messages,
but adding them could be fiddly.  Perhaps
diagnostic_manager::add_events_for_eedge could support some kind of
plugin hook whenever we add a statement_event, which you could use to
add your extra events whenever the ob_refcnt of the object of interest
differs between the old and the new state? (though currently we only
emit a statement_event for the first stmt in an enode, and that
function is already quite messy, alas)

But as I said before, that's definitely something to postpone to a
followup.

[...snip...]

> 
> @@ -620,8 +639,15 @@ class region_model_context
>  {
>   public:
>    /* Hook for clients to store pending diagnostics.
> -     Return true if the diagnostic was stored, or false if it was deleted.  */
> -  virtual bool warn (std::unique_ptr<pending_diagnostic> d) = 0;
> +     Return true if the diagnostic was stored, or false if it was deleted.
> +     Optionally provide a custom stmt_finder.  */
> +    virtual bool warn(std::unique_ptr<pending_diagnostic> d) {
> +        return warn(std::move(d), nullptr);
> +    }
> +    
> +    virtual bool warn(std::unique_ptr<pending_diagnostic> d, const stmt_finder *custom_finder) {
> +        return false;
> +    }

I found the pair of vfuncs here confusing.

Please can we have a single pure virtual "warn" vfunc, potentially with
an optional arg, so:

  virtual bool warn (std::unique_ptr<pending_diagnostic> d, 
                     const stmt_finder *custom_finder = nullptr)  = 0;

and add the new argument to the various implementations that override
"warn".

[...snip...]

>  /* A subclass of region_model_context for determining if operations fail
> @@ -912,6 +941,11 @@ class region_model_context_decorator : public region_model_context
>      return m_inner->get_stmt ();
>    }
>  
> +  const exploded_graph *get_eg () const override
> +  {
> +    return m_inner->get_eg ();

I recently changed region_model_context_decorator so that m_inner can
be null (see 1e7b0a5d7a45dc5932992d36e1375582d61269e4), so please can
you change this to:

const exploded_graph *get_eg () const override
{
  if (m_inner)
    return m_inner->get_eg ();
  else
    return nullptr;
}

[...snip...]

> 
> diff --git a/gcc/testsuite/gcc.dg/plugin/analyzer_cpython_plugin.c b/gcc/testsuite/gcc.dg/plugin/analyzer_cpython_plugin.c
> index 7cd72e8a886..a3274ced4a8 100644
> --- a/gcc/testsuite/gcc.dg/plugin/analyzer_cpython_plugin.c
> +++ b/gcc/testsuite/gcc.dg/plugin/analyzer_cpython_plugin.c

[...snip...]

> +class refcnt_mismatch : public pending_diagnostic_subclass<refcnt_mismatch>
> +{

[...snip...]

> +
> +  bool
> +  emit (rich_location *rich_loc, logger *) final override
> +  {
> +    diagnostic_metadata m;
> +    bool warned;
> +    // just assuming constants for now
> +    auto actual_refcnt
> +       = m_actual_refcnt->dyn_cast_constant_svalue ()->get_constant ();
> +    auto ob_refcnt = m_ob_refcnt->dyn_cast_constant_svalue ()->get_constant ();
> +    warned = warning_meta (
> +       rich_loc, m, get_controlling_option (),
> +       "Expected <variable name belonging to m_base_region> to have "
> +       "reference count: %qE but ob_refcnt field is: %qE",
> +       actual_refcnt, ob_refcnt);

This is OK for now.

Diagnostic messages should start with a lower-case letter.

Ideally we'd handle both symbolic and constant values.  We'd quote the
values if they're symbolic e.g.
  reference count: 2
but not quote them for constants, e.g.
  reference count: 'x + 1'

However doing so tends to lead to a combinatorial explosion here in the
possible messages, so maybe this could be split up as:

warning: reference count mismatch for PyObject * 'item'
note: 2 pointers found in memory to '*item'...
note: ...but '(*item).ob_refcnt' is 1

(where here both values are concrete)

But again, there's no need to fix this in this version of the code.

[...snip...]

> +/* Compare ob_refcnt field vs the actual reference count of a region */
> +void
> +check_refcnt (const region_model *model, region_model_context *ctxt,
> +             const hash_map<const ana::region *,
> +                            int>::iterator::reference_pair region_refcnt)

This function can be made "static" (as indeed can almost all of the
functions in the plugin, I expect, apart from "plugin_init"); these
aren't APIs being exposed to an external consumer.

> +{
> +  region_model_manager *mgr = model->get_manager ();
> +  const auto &curr_region = region_refcnt.first;
> +  const auto &actual_refcnt = region_refcnt.second;
> +  const svalue *ob_refcnt_sval = retrieve_ob_refcnt_sval (curr_region, model, ctxt);
> +  const svalue *actual_refcnt_sval = mgr->get_or_create_int_cst (
> +      ob_refcnt_sval->get_type (), actual_refcnt);
> +
> +  if (ob_refcnt_sval != actual_refcnt_sval)
> +  {
> +    // todo: fix this (always null)
> +    tree reg_tree = model->get_representative_tree (curr_region);
> +
> +    const auto &eg = ctxt->get_eg ();
> +    refcnt_stmt_finder finder (*eg, reg_tree);
> +    auto pd = make_unique<refcnt_mismatch> (curr_region, ob_refcnt_sval,
> +                                           actual_refcnt_sval, reg_tree);
> +    if (pd && eg)
> +    ctxt->warn (std::move (pd), &finder);

Note that IIRC the region_model code gets rerun with a NULL
region_model_context during feasibility checking, so we need to either
capture the invariant that ctxt is non-null here (with a gcc_assert),
and/or bail out early somewhere in all this checking for the null ctxt
case.

> +  }
> +}
> +
> +void
> +check_refcnts (const region_model *model, const svalue *retval,
> +           region_model_context *ctxt,
> +           hash_map<const region *, int> &region_to_refcnt)
> +{
> +  for (const auto &region_refcnt : region_to_refcnt)
> +  {
> +    check_refcnt(model, ctxt, region_refcnt);
> +  }
> +}
> +
> +/* Validates the reference count of all Python objects. */
> +void
> +pyobj_refcnt_checker (const region_model *model, const svalue *retval,
> +                   region_model_context *ctxt)
> +{
> +  if (!ctxt)
> +  return;

Aha: I see you have an early bailout here for the null ctxt, good.

> +
> +  auto region_to_refcnt = hash_map<const region *, int> ();
> +  auto seen_regions = hash_set<const region *> ();
> +
> +  count_expected_pyobj_references (model, region_to_refcnt, retval, seen_regions);
> +  check_refcnts (model, retval, ctxt, region_to_refcnt);
> +}
> +

[...snip...]

> +
> +void
> +dump_refcnt_info (const hash_map<const region *, int> &region_to_refcnt,
> +                 const region_model *model, region_model_context *ctxt)
> +{
> +  region_model_manager *mgr = model->get_manager ();
> +  pretty_printer pp;
> +  pp_format_decoder (&pp) = default_tree_printer;
> +  pp_show_color (&pp) = pp_show_color (global_dc->printer);
> +  pp.buffer->stream = stderr;
> +
> +  for (const auto &region_refcnt : region_to_refcnt)
> +  {

FWIW, the iteration order within a hash_map will vary from run to run
(e.g. precise pointer values are affected by address space layout
randomization), and so the ordering within the output to stderr will
vary from run to run.

It's OK for now, but it might be worth first building a vec, sorting it
in some well-defined way, and then iterating over that for the
printing; see e.g. b0702ac5588333e27d7ec43d21d704521f7a05c6.

> +    auto region = region_refcnt.first;
> +    auto actual_refcnt = region_refcnt.second;
> +    const svalue *ob_refcnt_sval
> +       = retrieve_ob_refcnt_sval (region, model, ctxt);
> +    const svalue *actual_refcnt_sval = mgr->get_or_create_int_cst (
> +       ob_refcnt_sval->get_type (), actual_refcnt);
> +
> +    region->dump_to_pp (&pp, true);
> +    pp_string (&pp, " — ob_refcnt: ");
> +    ob_refcnt_sval->dump_to_pp (&pp, true);
> +    pp_string (&pp, " actual refcnt: ");
> +    actual_refcnt_sval->dump_to_pp (&pp, true);
> +    pp_newline (&pp);
> +  }
> +  pp_string (&pp, "~~~~~~~~\n");
> +  pp_flush (&pp);
> +}
> +
> 
[...snip...]

Does the testcase pass or fail with this code?  Do you get a new
warning for it?

How about some trivial new testcases e.g.:

void test_dropping_PyLong (long val)
{
   PyObject* p = PyLong_FromLong(val);
   /* Do nothing with 'p'; should be reported as a leak.  */
}

PyObject *test_valid_PyLong (long val)
{
   PyObject* p = PyLong_FromLong(val);
   return p; // shouldn't report here
}

PyObject *test_stray_incref_PyLong (long val)
{
   PyObject* p = PyLong_FromLong(val);
   Py_INCREF (p);  // bogus
   return p; // should report that ob_refcnt is 2 when it should be 1
}

...or somesuch.

The patch is almost ready for trunk as-is, but:
- needs a ChangeLog
- please fix the dual "warn" vfuncs thing I mentioned above
- check it compiles!
- make sure the testsuite doesn't emit any new FAILs (e.g. by adding a
dg-warning with a suitable short regex, e.g.
  /* { dg-warning "reference count" ) */

Hope this is constructive
Dave

next prev parent reply	other threads:[~2023-08-25 19:50 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-25  4:49 Update and Questions " Eric Feng
2023-07-25 14:41 ` David Malcolm
2023-07-27 22:13   ` Eric Feng
2023-07-27 22:35     ` David Malcolm
2023-07-30 17:52       ` Eric Feng
2023-07-30 23:44         ` David Malcolm
2023-08-01 13:57           ` Eric Feng
2023-08-01 17:06             ` David Malcolm
2023-08-04 15:02               ` Eric Feng
2023-08-04 15:39                 ` David Malcolm
2023-08-04 20:48                   ` Eric Feng
2023-08-04 22:42                     ` David Malcolm
2023-08-04 22:46                       ` David Malcolm
2023-08-07 18:31                         ` Eric Feng
2023-08-07 23:16                           ` David Malcolm
2023-08-08 16:51                             ` [PATCH] WIP for dg-require-python-h [PR107646] Eric Feng
2023-08-08 18:08                               ` David Malcolm
2023-08-08 18:51                               ` David Malcolm
2023-08-09 19:22                                 ` [PATCH v2] analyzer: More features for CPython analyzer plugin [PR107646] Eric Feng
2023-08-09 21:36                                   ` David Malcolm
2023-08-11 17:47                                     ` [COMMITTED] " Eric Feng
2023-08-11 20:23                                       ` Eric Feng
2023-08-16 19:17                                         ` Update on CPython Extension Module -fanalyzer plugin development Eric Feng
2023-08-16 21:28                                           ` David Malcolm
2023-08-17  1:47                                             ` Eric Feng
2023-08-21 14:05                                               ` Eric Feng
2023-08-21 15:04                                                 ` David Malcolm
2023-08-23 21:15                                                   ` Eric Feng
2023-08-23 23:16                                                     ` David Malcolm
2023-08-24 14:45                                                       ` Eric Feng
2023-08-25 12:50                                                         ` Eric Feng
2023-08-25 19:50                                                           ` David Malcolm [this message]
2023-08-29  4:31                                                             ` [PATCH] analyzer: implement reference count checking for CPython plugin [PR107646] Eric Feng
2023-08-29  4:35                                                               ` Eric Feng
2023-08-29 17:28                                                                 ` Eric Feng
2023-08-29 21:14                                                                   ` David Malcolm
2023-08-30 22:15                                                                     ` Eric Feng
2023-08-31 17:01                                                                       ` David Malcolm
2023-08-31 19:09                                                                         ` Eric Feng
2023-08-31 20:19                                                                           ` David Malcolm
2023-09-01  1:25                                                                             ` Eric Feng
2023-09-01 11:57                                                                               ` David Malcolm
2023-09-05  2:13                                                                                 ` [PATCH] analyzer: implement symbolic value support for CPython plugin's refcnt checker [PR107646] Eric Feng
2023-09-07 17:28                                                                                   ` David Malcolm
2023-09-11  2:12                                                                                     ` Eric Feng
2023-09-11 19:00                                                                                       ` David Malcolm
2023-08-29 21:08                                                               ` [PATCH] analyzer: implement reference count checking for CPython plugin [PR107646] David Malcolm
2023-09-01  2:49                                                               ` Hans-Peter Nilsson
2023-09-01 14:51                                                                 ` David Malcolm
2023-09-01 21:07                                                                   ` Eric Feng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c720fc39d86cb3ae8041fe5c86dad34d71466cec.camel@redhat.com \
    --to=dmalcolm@redhat.com \
    --cc=ef2648@columbia.edu \
    --cc=gcc@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).