Re: [PATCH] gdb/python: make more use of host_string_to_python_string

public inbox for gdb-patches@sourceware.org
 help / color / mirror / Atom feed

From: Andrew Burgess <aburgess@redhat.com>
To: gdb-patches@sourceware.org
Subject: Re: [PATCH] gdb/python: make more use of host_string_to_python_string
Date: Thu, 23 Dec 2021 11:35:38 +0000	[thread overview]
Message-ID: <20211223113538.GJ2634@redhat.com> (raw)
In-Reply-To: <20211206181030.2463266-1-aburgess@redhat.com>

* Andrew Burgess <aburgess@redhat.com> [2021-12-06 18:10:30 +0000]:

> We have a function host_string_to_python_string, which is a wrapper
> around PyString_Decode, which, on Python 3, is an alias for
> PyUnicode_Decode.
> 
> However, there are a few places where we still call PyUnicode_Decode
> directly.
> 
> This commit replaces all uses of PyUnicode_Decode with calls to
> host_string_to_python_string instead.
> 
> To make the use of host_string_to_python_string easier, I've added a
> couple of overloads for this function in python-internal.h, these all
> just forward their calls onto the single base implementation.  The
> signatures of all host_string_to_python_string overloads are:
> 
>   gdbpy_ref<> host_string_to_python_string (const char *str, size_t len);
>   gdbpy_ref<> host_string_to_python_string (const char *str)
>   gdbpy_ref<> host_string_to_python_string (const string_file &str)
> 
> For Python 3 PyString_Decode is setup (in python-internal.h) as an
> alias for PyUnicode_Decode, so there should certainly be no user
> visible changes in this case.
> 
> For Python 2 this commit will change the behaviour.  Previously by
> calling PyUnicode_Decode we would have been returning a Unicode
> object.  Now, after calling host_string_to_python_string, we will have
> a str object.
> 
> I've checked the GDB documentation, and, as far as I can tell, the
> methods I've touched were all documented as returning a string, or in
> the gdb.Command case, take a string as an argument, and my
> understanding is that for Python 2, string generally means str.  So I
> think the new behaviour would be more expected.
> 
> A different solution, that would also make things more consistent in
> the Python 2 world, would be to have host_string_to_python_string
> always return a Unicode object.  However, I've been reading this page:
> 
>   https://pythonhosted.org/kitchen/unicode-frustrations.html
> 
> item #3 recommends that unicode strings should be converted to str
> objects before being printed (in Python 2).  That would mean that
> users should have been adding .encode() calls to the output of the
> routines I've changed in this commit (if they wanted to print that
> output), which is not something I think is made clear from the GDB
> documentation.
> ---
>  gdb/python/py-cmd.c          |  9 +++------
>  gdb/python/py-frame.c        |  5 ++---
>  gdb/python/py-type.c         |  3 +--
>  gdb/python/py-utils.c        |  5 ++---
>  gdb/python/py-value.c        |  4 ++--
>  gdb/python/python-internal.h | 11 ++++++++++-
>  gdb/python/python.c          |  4 ++--
>  7 files changed, 22 insertions(+), 19 deletions(-)
> 
> diff --git a/gdb/python/py-cmd.c b/gdb/python/py-cmd.c
> index 94608e0bbcf..b2cafeba320 100644
> --- a/gdb/python/py-cmd.c
> +++ b/gdb/python/py-cmd.c
> @@ -120,8 +120,7 @@ cmdpy_function (const char *args, int from_tty, cmd_list_element *command)
>  
>    if (! args)
>      args = "";
> -  gdbpy_ref<> argobj (PyUnicode_Decode (args, strlen (args), host_charset (),
> -					NULL));
> +  gdbpy_ref<> argobj = host_string_to_python_string (args);
>    if (argobj == NULL)
>      {
>        gdbpy_print_stack ();
> @@ -181,8 +180,7 @@ cmdpy_completer_helper (struct cmd_list_element *command,
>        return NULL;
>      }
>  
> -  gdbpy_ref<> textobj (PyUnicode_Decode (text, strlen (text), host_charset (),
> -					 NULL));
> +  gdbpy_ref<> textobj = host_string_to_python_string (text);
>    if (textobj == NULL)
>      error (_("Could not convert argument to Python string."));
>  
> @@ -194,8 +192,7 @@ cmdpy_completer_helper (struct cmd_list_element *command,
>      }
>    else
>      {
> -      wordobj.reset (PyUnicode_Decode (word, strlen (word), host_charset (),
> -				       NULL));
> +      wordobj = host_string_to_python_string (word);
>        if (wordobj == NULL)
>  	error (_("Could not convert argument to Python string."));
>      }
> diff --git a/gdb/python/py-frame.c b/gdb/python/py-frame.c
> index ee57eb10576..b507ff0794f 100644
> --- a/gdb/python/py-frame.c
> +++ b/gdb/python/py-frame.c
> @@ -131,8 +131,7 @@ frapy_name (PyObject *self, PyObject *args)
>  
>    if (name)
>      {
> -      result = PyUnicode_Decode (name.get (), strlen (name.get ()),
> -				 host_charset (), NULL);
> +      result = host_string_to_python_string (name.get ()).release ();
>      }
>    else
>      {
> @@ -658,7 +657,7 @@ gdbpy_frame_stop_reason_string (PyObject *self, PyObject *args)
>      }
>  
>    str = unwind_stop_reason_to_string ((enum unwind_stop_reason) reason);
> -  return PyUnicode_Decode (str, strlen (str), host_charset (), NULL);
> +  return host_string_to_python_string (str).release ();
>  }
>  
>  /* Implements the equality comparison for Frame objects.
> diff --git a/gdb/python/py-type.c b/gdb/python/py-type.c
> index 8b17b70fbe3..a178c6a4ab2 100644
> --- a/gdb/python/py-type.c
> +++ b/gdb/python/py-type.c
> @@ -1033,8 +1033,7 @@ typy_str (PyObject *self)
>        GDB_PY_HANDLE_EXCEPTION (except);
>      }
>  
> -  return PyUnicode_Decode (thetype.c_str (), thetype.size (),
> -			   host_charset (), NULL);
> +  return host_string_to_python_string (thetype).release ();
>  }
>  
>  /* Implement the richcompare method.  */
> diff --git a/gdb/python/py-utils.c b/gdb/python/py-utils.c
> index 10c4173efcd..2eb1ed2a09e 100644
> --- a/gdb/python/py-utils.c
> +++ b/gdb/python/py-utils.c
> @@ -152,10 +152,9 @@ python_string_to_host_string (PyObject *obj)
>  /* Convert a host string to a python string.  */
>  
>  gdbpy_ref<>
> -host_string_to_python_string (const char *str)
> +host_string_to_python_string (const char *str, size_t len)
>  {
> -  return gdbpy_ref<> (PyString_Decode (str, strlen (str), host_charset (),
> -				       NULL));
> +  return gdbpy_ref<> (PyString_Decode (str, len, host_charset (), nullptr));
>  }
>  
>  /* Return true if OBJ is a Python string or unicode object, false
> diff --git a/gdb/python/py-value.c b/gdb/python/py-value.c
> index c843c2c3072..8bd30729454 100644
> --- a/gdb/python/py-value.c
> +++ b/gdb/python/py-value.c
> @@ -743,7 +743,7 @@ valpy_format_string (PyObject *self, PyObject *args, PyObject *kw)
>        GDB_PY_HANDLE_EXCEPTION (except);
>      }
>  
> -  return PyUnicode_Decode (stb.c_str (), stb.size (), host_charset (), NULL);
> +  return host_string_to_python_string (stb).release ();
>  }
>  
>  /* A helper function that implements the various cast operators.  */
> @@ -1149,7 +1149,7 @@ valpy_str (PyObject *self)
>        GDB_PY_HANDLE_EXCEPTION (except);
>      }
>  
> -  return PyUnicode_Decode (stb.c_str (), stb.size (), host_charset (), NULL);
> +  return host_string_to_python_string (stb).release ();
>  }
>  
>  /* Implements gdb.Value.is_optimized_out.  */
> diff --git a/gdb/python/python-internal.h b/gdb/python/python-internal.h
> index 211833e4b2d..195cb0a1896 100644
> --- a/gdb/python/python-internal.h
> +++ b/gdb/python/python-internal.h
> @@ -22,6 +22,7 @@
>  
>  #include "extension.h"
>  #include "extension-priv.h"
> +#include "ui-file.h"
>  
>  /* These WITH_* macros are defined by the CPython API checker that
>     comes with the Python plugin for GCC.  See:
> @@ -715,7 +716,15 @@ gdb::unique_xmalloc_ptr<char> unicode_to_target_string (PyObject *unicode_str);
>  gdb::unique_xmalloc_ptr<char> python_string_to_target_string (PyObject *obj);
>  gdbpy_ref<> python_string_to_target_python_string (PyObject *obj);
>  gdb::unique_xmalloc_ptr<char> python_string_to_host_string (PyObject *obj);
> -gdbpy_ref<> host_string_to_python_string (const char *str);
> +gdbpy_ref<> host_string_to_python_string (const char *str, size_t len);
> +static inline gdbpy_ref<> host_string_to_python_string (const char *str)
> +{
> +  return host_string_to_python_string (str, strlen (str));
> +}
> +static inline gdbpy_ref<> host_string_to_python_string (const string_file &str)
> +{
> +  return host_string_to_python_string (str.c_str (), str.size ());
> +}
>  int gdbpy_is_string (PyObject *obj);
>  gdb::unique_xmalloc_ptr<char> gdbpy_obj_to_string (PyObject *obj);
>  
> diff --git a/gdb/python/python.c b/gdb/python/python.c
> index 82af012068b..6e85d30ed97 100644
> --- a/gdb/python/python.c
> +++ b/gdb/python/python.c
> @@ -558,7 +558,7 @@ gdbpy_target_charset (PyObject *self, PyObject *args)
>  {
>    const char *cset = target_charset (python_gdbarch);
>  
> -  return PyUnicode_Decode (cset, strlen (cset), host_charset (), NULL);
> +  return host_string_to_python_string (cset).release ();
>  }

I wonder if we are over using 'host_charset ()'?  Consider this
session:

  gdb) set host-charset UTF-32
  (gdb) python print(gdb.target_charset ())
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
  UnicodeDecodeError: 'utf-32-le' codec can't decode bytes in position 0-3: code point not in range(0x110000)
  Error while executing Python code.
  (gdb) 

Because the string we're encoding is not in host_charset, but is an
ascii string from within GDB.  I wonder if in situations like this we
should actually be saying:

  static PyObject *
  gdbpy_target_charset (PyObject *self, PyObject *args)
  {
    const char *cset = target_charset (python_gdbarch);

    return PyUnicode_Decode (cset, strlen (cset), "utf-8", NULL);
  }

Then this function should be documented as returning a utf-8 unicode
object.

Though, having said that, I wonder if the real problem is with 'set
host-charset'?  Consider:

  (gdb) help set host-charset
  Set the host character set.
  The `host character set' is the one used by the system GDB is running on.
  You may only use supersets of ASCII for your host character set; GDB does
  not support any others.
  To see a list of the character sets GDB supports, type `set host-charset <TAB>'.

It's not clear what a superset of ASCII is, but I would have assumed
it meant a character set where the bytes for any ascii character are
unchanged in the unicode encoding, which, as I understand it, is not
the case for things like utf-32.  So maybe superset of ASCII means
something different... or maybe I'm not understanding utf-32
correctly...

Thanks,
Andrew

     prev parent reply	other threads:[~2021-12-23 11:35 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-06 18:10 Andrew Burgess
2021-12-22 15:08 ` PING: " Andrew Burgess
2021-12-23  2:19 ` Simon Marchi
2021-12-23 11:07   ` Andrew Burgess
2021-12-24  3:10     ` Simon Marchi
2021-12-24 12:20       ` Andrew Burgess
2021-12-27  1:30         ` Simon Marchi
2021-12-23 11:35 ` Andrew Burgess [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20211223113538.GJ2634@redhat.com \
    --to=aburgess@redhat.com \
    --cc=gdb-patches@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).