public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed
From: David Malcolm <dmalcolm@redhat.com>
To: Mir Immad <mirimnan017@gmail.com>, gcc@gcc.gnu.org
Subject: Re: [PATCH] analyzer PR 106003
Date: Thu, 23 Jun 2022 19:20:51 -0400	[thread overview]
Message-ID: <1da7fe0090c220a7c049b3527f2735ddc60b5659.camel@redhat.com> (raw)
In-Reply-To: <CAE1-7ox+fZV3oKNuq5sCMvLK4Y1Hyo7pOwmV8kTckbrEYPYEcQ@mail.gmail.com>

On Fri, 2022-06-24 at 00:00 +0530, Mir Immad wrote:

Thanks for the updated patch.

This is close to being ready.

One big issue is the target hook idea for isolating the target's
definition of the O_* flags as per Joseph's suggestion here:
  https://gcc.gnu.org/pipermail/gcc/2022-June/238961.html
Given Joseph's comment here:
  https://gcc.gnu.org/pipermail/gcc/2022-June/238954.html
I'm inclined to think that the target hook stuff could be done as a
followup.

Indentation has gone a bit wrong (presumably due to issue with the
mailer); the main thing to watch out for is to use tabs for 8 space
indentations within the file.  Enabling "visual whitespace" (or
whatever it's called) in your editor can help with this.

Every patch that adds any options (e.g. to gcc/analyzer/analyzer.opt)
needs to document those options, in gcc/doc/invoke.texi.  Grep for one
of the existing analyzer options in invoke.texi to see what you need to
add for the new options.

You'll need a ChangeLog entry before this can be pushed.  For better or
worse, the project requires ChangeLog entries.  You can
use the script ./contrib/mklog.py to help generate this.  Some more
information is here:
  https://www.gnu.org/prep/standards/html_node/Change-Logs.html
In GCC our ChangeLog entries are part of the git commit message, and we
have a serverside script to update the ChangeLog files in the source
tree (to avoid constantly having merge conflicts).

The Subject line for the commit should have an "analyzer: " prefix
and reference the bugzilla problem report (PR 106003)

Before you can push the patch, you'll have to test it with a full
bootstrap and regression test run.  Let me know if you need help with
that.

Various other comments inline below...

[...snip...]

> diff --git gcc/analyzer/analyzer.opt gcc/analyzer/analyzer.opt
> index 23dfc797cea..e2a629bb42c 100644
> --- gcc/analyzer/analyzer.opt
> +++ gcc/analyzer/analyzer.opt

[...snip...]
> +Wanalyzer-fd-use-without-check
> +Common Var(warn_analyzer_fd_use_without_check) Init(1) Warning
> +warn about code paths in which a possibly invalid file descriptor is
> passed to a must-be-a-valid file descriptor function argument.

These should be capitalized as English sentences, and the "must-be-a-
valid" wording doesn't feel quite right to me.

Reword this to something like:

"Warn about code paths in which a file descriptor is used without being
checked for validity."

or somesuch (all on one line).

[...snip...]

> diff --git gcc/analyzer/sm-fd.cc gcc/analyzer/sm-fd.cc
> new file mode 100644
> index 00000000000..048014d7a26
> --- /dev/null
> +++ gcc/analyzer/sm-fd.cc
> @@ -0,0 +1,770 @@
> +/* A state machine for detecting misuses of POSIX file descriptor APIs.
> +   Copyright (C) 2019-2022 Free Software Foundation, Inc.
> +   Contributed by Immad Mir <mirimmad17@gmail.com>.
> +
> +This file is part of GCC.
> +
> +GCC is free software; you can redistribute it and/or modify it
> +under the terms of the GNU General Public License as published by
> +the Free Software Foundation; either version 3, or (at your option)
> +any later version.
> +
> +GCC is distributed in the hope that it will be useful, but
> +WITHOUT ANY WARRANTY; without even the implied warranty of
> +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +General Public License for more details.
> +
> +You should have received a copy of the GNU General Public License
> +along with GCC; see the file COPYING3.  If not see
> +<http://www.gnu.org/licenses/>.  */
> +
> +#include "config.h"
> +#include "system.h"
> +#include "coretypes.h"
> +#include "tree.h"
> +#include "function.h"
> +#include "basic-block.h"
> +#include "gimple.h"
> +#include "options.h"
> +#include "diagnostic-path.h"
> +#include "diagnostic-metadata.h"
> +#include "function.h"
> +#include "json.h"
> +#include "analyzer/analyzer.h"
> +#include "diagnostic-event-id.h"
> +#include "analyzer/analyzer-logging.h"
> +#include "analyzer/sm.h"
> +#include "analyzer/pending-diagnostic.h"
> +#include "analyzer/function-set.h"
> +#include "analyzer/analyzer-selftests.h"
> +#include "tristate.h"
> +#include "selftest.h"
> +#include "analyzer/call-string.h"
> +#include "analyzer/program-point.h"
> +#include "analyzer/store.h"
> +#include "analyzer/region-model.h"
> +
> +#include <unistd.h>

We'll want to drop the #include <unistd.h> once we have a target hook.

[...snip...]
> +
> +  state_machine::state_t
> +  get_default_state (const svalue *sval) const final override
> +  {
> +    if (tree cst = sval->maybe_get_constant ())
> +      {
> +        int val = TREE_INT_CST_LOW (cst);
> +        if (val < 0)
> +          {
> +            return m_invalid;
> +          }
> +      }
> +    return m_start;
> +  }

Please add coverage to the testsuite for the case of accessing negative
and non-negative constant integer values: what happens when we hit the
above cases?

In particular, IIRC 0, 1, and 2 are stdin, stdout, and stderr
respectively, and are already open at the start of "main" (though not
necessarily at the start of any given *analysis path*, so we can't
assume that) - so there may be code out there that has hardcoded
accesses using these constant integer values.

[...snip...]

> +
> +  /* State for reperesenting a file descriptor that is known to be valid (>= 0),
> +    for three different access */

Add " modes." to the end of the comment.

[...snip...]

> +/* Base diagnostic class relative to fd_state_machine. */
> +class fd_diagnostic : public pending_diagnostic
> +{
> +public:
> +  fd_diagnostic (const fd_state_machine &sm, tree arg) : m_sm (sm), m_arg
> (arg)
> +  {
> +  }
> +
> +  bool
> +  subclass_equal_p (const pending_diagnostic &base_other) const override
> +  {
> +    return same_tree_p (m_arg, ((const fd_diagnostic &)base_other).m_arg);
> +  }
> +
> +  label_text
> +  describe_state_change (const evdesc::state_change &change) override
> +  {
> +    if (change.m_old_state == m_sm.get_start_state ()
> +        && m_sm.is_unchecked (change.m_new_state))
> +      {
> +        if (change.m_new_state == m_sm.m_unchecked_read_write)
> +          return change.formatted_print ("opened here as %<read-write%>");
> +
> +        if (change.m_new_state == m_sm.m_unchecked_read_only)
> +          return change.formatted_print ("opened here as %<read-only%>");
> +
> +        if (change.m_new_state == m_sm.m_unchecked_write_only)
> +          return change.formatted_print ("opened here as %<write-only%>");

"read-write" etc in these messages are part of the text of the
messages, rather than quoting a source-language construct, so drop the
%< and %> here.

> +      }
> +
> +    if (change.m_new_state == m_sm.m_closed)
> +      return change.formatted_print ("closed here");
> +
> +    if (m_sm.is_unchecked (change.m_old_state)
> +        && m_sm.is_valid (change.m_new_state))
> +      if (change.m_expr)
> +        return change.formatted_print (
> +            "assuming %qE is a valid file descriptor (>= 0)",
> change.m_expr);
> +      else
> +        return change.formatted_print ("assuming a valid file descriptor");
> +
> +    if (m_sm.is_unchecked (change.m_old_state)
> +        && change.m_new_state == m_sm.m_invalid)
> +      if (change.m_expr)
> +        return change.formatted_print (
> +            "assuming %qE is an invalid file descriptor (< 0)",
> change.m_expr);
> +      else
> +        return change.formatted_print ("assuming an invalid file
> descriptor");
> +
> +    return label_text ();
> +  }
> +
> +protected:
> +  const fd_state_machine &m_sm;
> +  tree m_arg;
> +};
> +
> +class fd_access_mode_mismatch : public fd_diagnostic
> +{
> +public:
> +  /* mode specfies whether read on write was attempted or vice versa.
> +   */
> +  fd_access_mode_mismatch (const fd_state_machine &sm, tree arg,
> +                           enum access_mode mode, const tree callee_fndecl)
> +      : m_mode (mode), m_callee_fndecl (callee_fndecl), fd_diagnostic (sm,
> arg)
> +  {
> +  }

Every subclass of fd_diagnostic that adds fields ought to have its own
override of:
  subclass_equal_p (const pending_diagnostic &base_other) const override
which chains up to the parent class implementation, and also compares
the new fields.

> +
> +  const char *
> +  get_kind () const final override
> +  {
> +    return "fd_access_mode_mismatch";
> +  }
> +
> +  int
> +  get_controlling_option () const final override
> +  {
> +    return OPT_Wanalyzer_fd_access_mode_mismatch;
> +  }
> +
> +  bool
> +  emit (rich_location *rich_loc) final override
> +  {
> +    return warning_at (rich_loc, get_controlling_option (),
> +                       "%qE on %qs file descriptor %qE", m_callee_fndecl,
> +                       m_sm.get_string_for_access_mode (m_mode), m_arg);

It's bad for localization to be constructing strings from fragments
like this.

It's a pain, but please eliminate get_string_for_access_mode in favor
of switch statements wherever you'd use it, like here.

> +  }
> +
> +  label_text
> +  describe_state_change (const evdesc::state_change &change) override
> +  {
> +    return fd_diagnostic::describe_state_change (change);
> +  }
> +
> +  label_text
> +  describe_final_event (const evdesc::final_event &ev) final override
> +  {
> +    return ev.formatted_print ("%qE on %qs file descriptor %qE",
> +                               m_callee_fndecl,
> +                               m_sm.get_string_for_access_mode (m_mode), m_arg);

As above, please eliminate get_string_for_access_mode and convert to a
switch.

> +  }
> +
> +private:
> +  enum access_mode m_mode;
> +  const tree m_callee_fndecl;
> +};
> +

[...snip...]

> +
> +class fd_use_after_close : public fd_diagnostic
> +{
> +public:
> +  fd_use_after_close (const fd_state_machine &sm, tree arg,
> +                      const tree callee_fndecl)
> +      : m_callee_fndecl (callee_fndecl), fd_diagnostic (sm, arg)
> +  {
> +  }

This subclass has its own fields, and so needs to implement:
  subclass_equal_p (const pending_diagnostic &base_other) const override


[...snip...]

> +class unchecked_use_of_fd : public fd_diagnostic
> +{
> +public:
> +  unchecked_use_of_fd (const fd_state_machine &sm, tree arg,
> +                       const tree callee_fndecl)
> +      : m_callee_fndecl (callee_fndecl), fd_diagnostic (sm, arg)
> +  {
> +  }
> +
> +  const char *
> +  get_kind () const final override
> +  {
> +    return "unchecked_use_of_fd";
> +  }
> +
> +  int
> +  get_controlling_option () const final override
> +  {
> +    return OPT_Wanalyzer_fd_use_without_check;
> +  }
> +
> +  bool
> +  emit (rich_location *rich_loc) final override
> +  {
> +
> +    return warning_at (rich_loc, get_controlling_option (),
> +                       "%qE on possibly invalid file descriptor %qE",
> +                       m_callee_fndecl, m_arg);

How about:
  "%qE on file descriptor %qE without checking %<open%> succeeded"


> +  }
> +
> +  label_text
> +  describe_state_change (const evdesc::state_change &change) override
> +  {
> +    if (m_sm.is_unchecked (change.m_new_state))
> +      {
> +        m_first_open_event = change.m_event_id;
> +        return label_text::borrow ("opened here");
> +      }
> +
> +    return fd_diagnostic::describe_state_change (change);
> +  }
> +
> +  label_text
> +  describe_final_event (const evdesc::final_event &ev) final override
> +  {
> +    return ev.formatted_print ("%qE could be invalid: unchecked value from
> %@",
> +                               m_arg, &m_first_open_event);
> +  }
> +
> +private:
> +  diagnostic_event_id_t m_first_open_event;
> +  const tree m_callee_fndecl;
> +};
> +

[...snip...]

> +bool
> +fd_state_machine::is_unchecked (state_t s) const
> +{
> +  return    s == m_unchecked_read_write
> +         || s == m_unchecked_read_only
> +         || s == m_unchecked_write_only;
> +}

Thanks for splitting these onto multiple lines, but please wrap them
with parentheses, so they look like:

  return (s == m_unchecked_read_write
          || s == m_unchecked_read_only
          || s == m_unchecked_write_only);

> +
> +bool
> +fd_state_machine::is_valid (state_t s) const
> +{
> +  return    s == m_valid_read_write
> +         || s == m_valid_read_only
> +         || s == m_valid_write_only;
> +}

Likewise here.

> +
> +const char *
> +fd_state_machine::get_string_for_access_mode (enum access_mode mode) const
> +{
> +  if (mode == READ_ONLY)
> +    return "read-only";
> +  else if (mode == WRITE_ONLY)
> +    return "write-only";
> +  else
> +    return "read-write";
> +}

As noted above it's bad for localization to be constructing strings
from fragments like this, so please eliminate the above function, in
favor of switch statements wherever you'd use it.


> +
> +enum access_mode
> +fd_state_machine::get_access_mode_from_flag (int flag) const
> +{
> +  /* FIXME: this code assumes the access modes on the host and
> +          target are the same, which in practice might not be the case. */

Thanks for moving this into a subroutine.

Joseph says we should introduce a target hook for this:
  https://gcc.gnu.org/pipermail/gcc/2022-June/238961.html

You can see an example of adding a target hook in commit
8cdcea51c0fd753e6a652c9b236e91b3a6e0911c.

As noted above, I think we can leave adding the target hook until a
followup patch, if this is only going to be an issue with cross-
compilation between Hurd and non-Hurd systems.


> +  if (flag == O_RDONLY)
> +    {
> +      return READ_ONLY;
> +    }
> +  else if (flag == O_WRONLY)
> +    {
> +      return WRITE_ONLY;
> +    }
> +  return READ_WRITE;
> +}
> +
> +enum access_mode
> +fd_state_machine::get_access_mode_from_state (state_t state) const
> +{
> +  if (state == m_unchecked_read_only || state == m_valid_read_only)
> +    return READ_ONLY;
> +  else if (state == m_unchecked_write_only || state == m_valid_write_only)
> +    return WRITE_ONLY;
> +  else if (state == m_unchecked_read_write || state == m_valid_write_only)

I think there's a copy-and-paste error in the above line: ^^^^^^^^^^^^^^^^^^

> +    return READ_WRITE;
> +}

[...snip...]

> +void
> +fd_state_machine::on_open (sm_context *sm_ctxt, const supernode *node,
> +                           const gimple *stmt, const gcall *call) const
> +{
> +  tree lhs = gimple_call_lhs (call);
> +  if (lhs)
> +    {
> +      tree arg = gimple_call_arg (call, 1);
> +      if (TREE_CODE (arg) == INTEGER_CST)
> +        {
> +          int flag = TREE_INT_CST_LOW (arg);
> +          enum access_mode mode = get_access_mode_from_flag (flag);
> +          if (mode == READ_ONLY)
> +            sm_ctxt->on_transition (node, stmt, lhs, m_start,
> +                                    m_unchecked_read_only);
> +          else if (mode == WRITE_ONLY)
> +            sm_ctxt->on_transition (node, stmt, lhs, m_start,
> +                                    m_unchecked_write_only);
> +          else
> +            sm_ctxt->on_transition (node, stmt, lhs, m_start,
> +                                    m_unchecked_read_write);

I think this can be a switch (mode).

> +        }
> +    }
> +  else
> +    {
> +      /* FIXME: add leak reporting */
> +    }
> +}

[...snip...]

> +void
> +fd_state_machine::on_read (sm_context *sm_ctxt, const supernode *node,
> +                           const gimple *stmt, const gcall *call,
> +                           const tree callee_fndecl) const
> +{
> +  tree arg = gimple_call_arg (call, 0);
> +  tree diag_arg = sm_ctxt->get_diagnostic_tree (arg);
> +  state_t state = sm_ctxt->get_state (stmt, arg);
> +  enum access_mode mode = get_access_mode_from_state (state);
> +
> +  if (check_for_closed_fd (state))
> +    {
> +      sm_ctxt->warn (node, stmt, arg,
> +                     new fd_use_after_close (*this, diag_arg, callee_fndecl));
> +    }
> +
> +  if (!check_for_closed_fd (state))
> +    {

Why not just an "else" here?

> +
> +      if (!is_valid (sm_ctxt->get_state (stmt, arg)))
> +        {
> +          sm_ctxt->warn (
> +              node, stmt, arg,
> +              new unchecked_use_of_fd (*this, diag_arg, callee_fndecl));
> +        }
> +
> +      if (mode == WRITE_ONLY)
> +        {
> +          sm_ctxt->warn (node, stmt, arg,
> +                         new fd_access_mode_mismatch (*this, diag_arg, mode,
> +                                                      callee_fndecl));
> +        }
> +    }
> +}
> +void
> +fd_state_machine::on_write (sm_context *sm_ctxt, const supernode *node,
> +                            const gimple *stmt, const gcall *call,
> +                            const tree callee_fndecl) const
> +{
> +  tree arg = gimple_call_arg (call, 0);
> +  tree diag_arg = sm_ctxt->get_diagnostic_tree (arg);
> +  state_t state = sm_ctxt->get_state (stmt, arg);
> +  enum access_mode mode = get_access_mode_from_state (state);
> +
> +  if (check_for_closed_fd (state))
> +    {
> +      sm_ctxt->warn (node, stmt, arg,
> +                     new fd_use_after_close (*this, diag_arg,
> callee_fndecl));
> +    }
> +
> +  if (!check_for_closed_fd (state))
> +    {

Again, I think this can be another "else" here.

> +
> +      if (!is_valid (sm_ctxt->get_state (stmt, arg)))
> +        {
> +          sm_ctxt->warn (
> +              node, stmt, arg,
> +              new unchecked_use_of_fd (*this, diag_arg, callee_fndecl));
> +        }

...but I think it's slightly cleaner to move all of this repeated logic
into a "check_for_open_fd" subroutine, that checks both for
fd_use_after_close and for unchecked_use_of_fd.

> +
> +      if (mode == READ_ONLY)
> +        {
> +          tree diag_arg = sm_ctxt->get_diagnostic_tree (arg);
> +          sm_ctxt->warn (node, stmt, arg,
> +                         new fd_access_mode_mismatch (*this, diag_arg,
> mode,
> +                                                      callee_fndecl));
> +        }
> +    }
> +}
> +

[...snip...]

> diff --git gcc/testsuite/gcc.dg/analyzer/fd-1.c
> gcc/testsuite/gcc.dg/analyzer/fd-1.c
> new file mode 100644
> index 00000000000..985e9ac75de
> --- /dev/null
> +++ gcc/testsuite/gcc.dg/analyzer/fd-1.c
> @@ -0,0 +1,26 @@
> +int open(const char *, int mode);
> +#define O_RDONLY 0
> +#define O_WRONLY 1
> +#define O_RDWR 2

If we're going with a target hook, then we should use the target's
<unistd.h> to get at the O_ macros.  But as noted above, that can be
deferred to a follow-up patch.

[...snip...]

Hope this makes sense and is constructive; thanks again for the updated
patch

Dave


  reply	other threads:[~2022-06-23 23:20 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-23 18:30 Mir Immad
2022-06-23 23:20 ` David Malcolm [this message]
2022-07-14 15:40   ` David Malcolm
2022-06-25 16:00 Mir Immad
2022-06-26 15:53 ` David Malcolm
2022-06-27 17:07   ` Mir Immad
2022-06-27 17:08 Mir Immad
2022-06-27 20:55 ` David Malcolm

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1da7fe0090c220a7c049b3527f2735ddc60b5659.camel@redhat.com \
    --to=dmalcolm@redhat.com \
    --cc=gcc@gcc.gnu.org \
    --cc=mirimnan017@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).