On Thu, Sep 7, 2023 at 1:28 PM David Malcolm <dmalcolm@redhat.com> wrote:

> On Mon, 2023-09-04 at 22:13 -0400, Eric Feng wrote:
>
> > Hi Dave,
>
> Hi Eric, thanks for the patch.
>
> >
> > Recently I've been working on symbolic value support for the reference
> > count checker. I've attached a patch for it below; let me know it looks
> > OK for trunk. Thanks!
> >
> > Best,
> > Eric
> >
> > ---
> >
> > This patch enhances the reference count checker in the CPython plugin by
> > adding support for symbolic values. Whereas previously we were only able
> > to check the reference count of PyObject* objects created in the scope
> > of the function; we are now able to emit diagnostics on reference count
> > mismatch of objects that were, for example, passed in as a function
> > parameter.
> >
> > rc6.c:6:10: warning: expected ‘obj’ to have reference count: N + ‘1’ but
> ob_refcnt field is N + ‘2’
> >     6 |   return obj;
> >       |          ^~~
>
> [...snip...]
>
> >  create mode 100644
> gcc/testsuite/gcc.dg/plugin/cpython-plugin-test-refcnt.c
> >
> > diff --git a/gcc/testsuite/gcc.dg/plugin/analyzer_cpython_plugin.c
> b/gcc/testsuite/gcc.dg/plugin/analyzer_cpython_plugin.c
> > index bf1982e79c3..d7ecd7fce09 100644
> > --- a/gcc/testsuite/gcc.dg/plugin/analyzer_cpython_plugin.c
> > +++ b/gcc/testsuite/gcc.dg/plugin/analyzer_cpython_plugin.c
> > @@ -314,17 +314,20 @@ public:
> >    {
> >      diagnostic_metadata m;
> >      bool warned;
> > -    // just assuming constants for now
> > -    auto actual_refcnt
> > -     = m_actual_refcnt->dyn_cast_constant_svalue ()->get_constant ();
> > -    auto ob_refcnt = m_ob_refcnt->dyn_cast_constant_svalue
> ()->get_constant ();
> > -    warned = warning_meta (rich_loc, m, get_controlling_option (),
> > -                        "expected %qE to have "
> > -                        "reference count: %qE but ob_refcnt field is:
> %qE",
> > -                        m_reg_tree, actual_refcnt, ob_refcnt);
> > -
> > -    // location_t loc = rich_loc->get_loc ();
> > -    // foo (loc);
> > +
> > +    const auto *actual_refcnt_constant
> > +     = m_actual_refcnt->dyn_cast_constant_svalue ();
> > +    const auto *ob_refcnt_constant =
> m_ob_refcnt->dyn_cast_constant_svalue ();
> > +    if (!actual_refcnt_constant || !ob_refcnt_constant)
> > +      return false;
> > +
> > +    auto actual_refcnt = actual_refcnt_constant->get_constant ();
> > +    auto ob_refcnt = ob_refcnt_constant->get_constant ();
> > +    warned = warning_meta (
> > +     rich_loc, m, get_controlling_option (),
> > +     "expected %qE to have "
> > +     "reference count: N + %qE but ob_refcnt field is N + %qE",
> > +     m_reg_tree, actual_refcnt, ob_refcnt);
> >      return warned;
>
> I know you're emulating the old behavior I implemented way back in
> cpychecker, but I don't like that behavior :(
>
> Specifically, although the patch improves the behavior for symbolic
> values, it regresses the precision of wording for the concrete values
> case.  If we have e.g. a concrete ob_refcnt of 2, whereas we only have
> 1 pointer, then it's more readable to say:
>
>   warning: expected ‘obj’ to have reference count: ‘1’ but ob_refcnt
> field is ‘2’
>
> than:
>
>   warning: expected ‘obj’ to have reference count: N + ‘1’ but ob_refcnt
> field is N + ‘2’
>
> ...and we shouldn't quote concrete numbers, the message should be:
>
>   warning: expected ‘obj’ to have reference count of 1 but ob_refcnt field
> is 2


> or better:
>
>   warning: ‘*obj’ is pointed to by 1 pointer but 'ob_refcnt' field is 2
>
>
> Can you move the unwrapping of the svalue from the tests below into the
> emit vfunc?  That way the m_actual_refcnt doesn't have to be a
> constant_svalue; you could have logic in the emit vfunc to print
> readable messages based on what kind of svalue it is.
>
> Rather than 'N', it might be better to say 'initial'; how about:
>
>   warning: ‘*obj’ is pointed to by 0 additional pointers but 'ob_refcnt'
> field has increased by 1
>   warning: ‘*obj’ is pointed to by 1 additional pointer but 'ob_refcnt'
> field has increased by 2
>   warning: ‘*obj’ is pointed to by 1 additional pointer but 'ob_refcnt'
> field is unchanged
>   warning: ‘*obj’ is pointed to by 2 additional pointers but 'ob_refcnt'
> field has decreased by 1
>   warning: ‘*obj’ is pointed to by 1 fewer pointers but 'ob_refcnt' field
> is unchanged
>
> and similar?
>

That makes sense to me as well (indeed I was just emulating the old
behavior)! Will experiment and keep you posted on a revised patch with this
in mind.  This is somewhat of a minor detail but can we emit ‘*obj’ as
bolded text in the diagnostic message? Currently, I can emit this
(including the asterisk) like so: '*%E'. But unlike using %qE, it doesn't
bold the body of the single quotations. Is this possible?

>
> Maybe have a flag that tracks whether we're talking about a concrete
> value that's absolute versus a concrete value that's relative to the
> initial value?
>
>
> [...snip...]
>
>
> > @@ -369,6 +368,19 @@ increment_region_refcnt (hash_map<const region *,
> int> &map, const region *key)
> >    refcnt = existed ? refcnt + 1 : 1;
> >  }
> >
> > +static const region *
> > +get_region_from_svalue (const svalue *sval, region_model_manager *mgr)
> > +{
> > +  const auto *region_sval = sval->dyn_cast_region_svalue ();
> > +  if (region_sval)
> > +    return region_sval->get_pointee ();
> > +
> > +  const auto *initial_sval = sval->dyn_cast_initial_svalue ();
> > +  if (initial_sval)
> > +    return mgr->get_symbolic_region (initial_sval);
> > +
> > +  return nullptr;
> > +}
>
> This is dereferencing a pointer, right?
>
> Can the caller use region_model::deref_rvalue instead?
>
>
> [...snip...]
>
> > +static void
> > +unwrap_any_ob_refcnt_sval (const svalue *&ob_refcnt_sval)
> > +{
> > +  if (ob_refcnt_sval->get_kind () != SK_CONSTANT)
> > +    {
> > +      auto unwrap_cast = ob_refcnt_sval->maybe_undo_cast ();
> > +      if (!unwrap_cast)
> > +     unwrap_cast = ob_refcnt_sval;
> > +
> > +      if (unwrap_cast->get_kind () == SK_BINOP)
> > +     ob_refcnt_sval = unwrap_cast->dyn_cast_binop_svalue ()->get_arg1
> ();
>
> This would be better spelled as:
>
>          if (const binop_svalue *binop_sval =
> unwrap_cast->dyn_cast_binop_svalue ())
>             ob_refcnt_sval = binop_sval->get_arg1 ();
>
> [...snip...]
>
> >  /* Compare ob_refcnt field vs the actual reference count of a region */
> >  static void
> >  check_refcnt (const region_model *model,
> >             const region_model *old_model,
> >             region_model_context *ctxt,
> >             const hash_map<const ana::region *,
> > -                          int>::iterator::reference_pair region_refcnt)
> > +                          int>::iterator::reference_pair &region_refcnt)
>
> Could really use a typedef for
>   const hash_map<const ana::region *, int>
> to simplify this code.
>
> >  {
> >    region_model_manager *mgr = model->get_manager ();
> >    const auto &curr_region = region_refcnt.first;
> >    const auto &actual_refcnt = region_refcnt.second;
> > +
> >    const svalue *ob_refcnt_sval
> >        = retrieve_ob_refcnt_sval (curr_region, model, ctxt);
> > +  if (!ob_refcnt_sval)
> > +    return;
> > +
> > +  unwrap_any_ob_refcnt_sval (ob_refcnt_sval);
>
> As noted above, can the diagnostic store the pre-unwrapped
> ob_refcnt_sval?  Might mean you have to do the unwrapping both here,
> and later when displaying the diagnostic.  Or (probably best) track
> both the original and unwrapped ob_refcnt_sval, and store both in the
> pending_diagnostic.
>
> > +
> >    const svalue *actual_refcnt_sval = mgr->get_or_create_int_cst (
> >        ob_refcnt_sval->get_type (), actual_refcnt);
> > -
> >    if (ob_refcnt_sval != actual_refcnt_sval)
> > -    {
> > -      const svalue *curr_reg_sval
> > -       = mgr->get_ptr_svalue (pyobj_ptr_tree, curr_region);
> > -      tree reg_tree = old_model->get_representative_tree
> (curr_reg_sval);
> > -      if (!reg_tree)
> > -     return;
> > -
> > -      const auto &eg = ctxt->get_eg ();
> > -      refcnt_stmt_finder finder (*eg, reg_tree);
> > -      auto pd = make_unique<refcnt_mismatch> (curr_region,
> ob_refcnt_sval,
> > -                                           actual_refcnt_sval,
> reg_tree);
> > -      if (pd && eg)
> > -     ctxt->warn (std::move (pd), &finder);
> > -    }
> > +    handle_refcnt_mismatch (old_model, curr_region, ob_refcnt_sval,
> > +                         actual_refcnt_sval, ctxt);
> >  }
> >
> >  static void
> > @@ -493,8 +520,6 @@ count_all_references (const region_model *model,
> >    for (const auto &cluster : *model->get_store ())
> >      {
> >        auto curr_region = cluster.first;
> > -      if (curr_region->get_kind () != RK_HEAP_ALLOCATED)
> > -     continue;
> >
> >        increment_region_refcnt (region_to_refcnt, curr_region);
> >
> > @@ -505,8 +530,8 @@ count_all_references (const region_model *model,
> >
> >         const svalue *unwrapped_sval
> >             = binding_sval->unwrap_any_unmergeable ();
> > -       // if (unwrapped_sval->get_type () != pyobj_ptr_tree)
> > -       // continue;
> > +       if (unwrapped_sval->get_type () != pyobj_ptr_tree)
> > +       continue;
>
> We'll probably want a smarter test for this, that the type "inherits"
> C-style from PyObject (e.g. PyLongObject).
>
>
> >
> >         const region *pointee = unwrapped_sval->maybe_get_region ();
> >         if (!pointee || pointee->get_kind () != RK_HEAP_ALLOCATED)
> > diff --git
> a/gcc/testsuite/gcc.dg/plugin/cpython-plugin-test-PyList_Append.c
> b/gcc/testsuite/gcc.dg/plugin/cpython-plugin-test-PyList_Append.c
> > index e1efd9efda5..46daf2f8975 100644
> > --- a/gcc/testsuite/gcc.dg/plugin/cpython-plugin-test-PyList_Append.c
> > +++ b/gcc/testsuite/gcc.dg/plugin/cpython-plugin-test-PyList_Append.c
> > @@ -75,4 +75,23 @@ test_PyListAppend_6 ()
> >    PyObject *list = NULL;
> >    PyList_Append(list, item);
> >    return list;
> > -}
> > \ No newline at end of file
> > +}
> > +
> > +PyObject *
> > +test_PyListAppend_7 (PyObject *item)
> > +{
> > +  PyObject *list = PyList_New (0);
> > +  Py_INCREF(item);
> > +  PyList_Append(list, item);
> > +  return list;
> > +  /* { dg-warning "expected 'item' to have reference count" "" { target
> *-*-* } .-1 } */
>
> It would be good if these dg-warning directives regexp contained the
> actual and expected counts; I find I can't easily tell what the
> intended output is meant to be.
>
>
> > +}
> > +
> > +PyObject *
> > +test_PyListAppend_8 (PyObject *item, PyObject *list)
> > +{
> > +  Py_INCREF(item);
> > +  Py_INCREF(item);
> > +  PyList_Append(list, item);
> > +  return list;
> > +}
>
> Should we complain here about item->ob_refcnt being too high?
>
> > diff --git a/gcc/testsuite/gcc.dg/plugin/cpython-plugin-test-refcnt.c
> b/gcc/testsuite/gcc.dg/plugin/cpython-plugin-test-refcnt.c
> > new file mode 100644
> > index 00000000000..a7f39509d6d
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/plugin/cpython-plugin-test-refcnt.c
> > @@ -0,0 +1,18 @@
> > +/* { dg-do compile } */
> > +/* { dg-require-effective-target analyzer } */
> > +/* { dg-options "-fanalyzer" } */
> > +/* { dg-require-python-h "" } */
> > +
> > +
> > +#define PY_SSIZE_T_CLEAN
> > +#include <Python.h>
> > +#include "../analyzer/analyzer-decls.h"
> > +
> > +PyObject *
> > +test_refcnt_1 (PyObject *obj)
> > +{
> > +  Py_INCREF(obj);
> > +  Py_INCREF(obj);
> > +  return obj;
> > +  /* { dg-warning "expected 'obj' to have reference count" "" { target
> *-*-* } .-1 } */
>
> Likewise, it would be better for the dg-warning directive's expressed
> the expected "actual vs expected" values.
>
> [...snip...]
>
> Thanks again for the patch, hope this is constructive
>
> Dave
>
>