[PATCH] gdb/python: make more use of host_string_to_python

public inbox for gdb-patches@sourceware.org
 help / color / mirror / Atom feed

* [PATCH] gdb/python: make more use of host_string_to_python_string
@ 2021-12-06 18:10 Andrew Burgess
  2021-12-22 15:08 ` PING: " Andrew Burgess
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Andrew Burgess @ 2021-12-06 18:10 UTC (permalink / raw)
  To: gdb-patches; +Cc: Andrew Burgess

We have a function host_string_to_python_string, which is a wrapper
around PyString_Decode, which, on Python 3, is an alias for
PyUnicode_Decode.

However, there are a few places where we still call PyUnicode_Decode
directly.

This commit replaces all uses of PyUnicode_Decode with calls to
host_string_to_python_string instead.

To make the use of host_string_to_python_string easier, I've added a
couple of overloads for this function in python-internal.h, these all
just forward their calls onto the single base implementation.  The
signatures of all host_string_to_python_string overloads are:

  gdbpy_ref<> host_string_to_python_string (const char *str, size_t len);
  gdbpy_ref<> host_string_to_python_string (const char *str)
  gdbpy_ref<> host_string_to_python_string (const string_file &str)

For Python 3 PyString_Decode is setup (in python-internal.h) as an
alias for PyUnicode_Decode, so there should certainly be no user
visible changes in this case.

For Python 2 this commit will change the behaviour.  Previously by
calling PyUnicode_Decode we would have been returning a Unicode
object.  Now, after calling host_string_to_python_string, we will have
a str object.

I've checked the GDB documentation, and, as far as I can tell, the
methods I've touched were all documented as returning a string, or in
the gdb.Command case, take a string as an argument, and my
understanding is that for Python 2, string generally means str.  So I
think the new behaviour would be more expected.

A different solution, that would also make things more consistent in
the Python 2 world, would be to have host_string_to_python_string
always return a Unicode object.  However, I've been reading this page:

  https://pythonhosted.org/kitchen/unicode-frustrations.html

item #3 recommends that unicode strings should be converted to str
objects before being printed (in Python 2).  That would mean that
users should have been adding .encode() calls to the output of the
routines I've changed in this commit (if they wanted to print that
output), which is not something I think is made clear from the GDB
documentation.
---
 gdb/python/py-cmd.c          |  9 +++------
 gdb/python/py-frame.c        |  5 ++---
 gdb/python/py-type.c         |  3 +--
 gdb/python/py-utils.c        |  5 ++---
 gdb/python/py-value.c        |  4 ++--
 gdb/python/python-internal.h | 11 ++++++++++-
 gdb/python/python.c          |  4 ++--
 7 files changed, 22 insertions(+), 19 deletions(-)

diff --git a/gdb/python/py-cmd.c b/gdb/python/py-cmd.c
index 94608e0bbcf..b2cafeba320 100644
--- a/gdb/python/py-cmd.c
+++ b/gdb/python/py-cmd.c
@@ -120,8 +120,7 @@ cmdpy_function (const char *args, int from_tty, cmd_list_element *command)
 
   if (! args)
     args = "";
-  gdbpy_ref<> argobj (PyUnicode_Decode (args, strlen (args), host_charset (),
-					NULL));
+  gdbpy_ref<> argobj = host_string_to_python_string (args);
   if (argobj == NULL)
     {
       gdbpy_print_stack ();
@@ -181,8 +180,7 @@ cmdpy_completer_helper (struct cmd_list_element *command,
       return NULL;
     }
 
-  gdbpy_ref<> textobj (PyUnicode_Decode (text, strlen (text), host_charset (),
-					 NULL));
+  gdbpy_ref<> textobj = host_string_to_python_string (text);
   if (textobj == NULL)
     error (_("Could not convert argument to Python string."));
 
@@ -194,8 +192,7 @@ cmdpy_completer_helper (struct cmd_list_element *command,
     }
   else
     {
-      wordobj.reset (PyUnicode_Decode (word, strlen (word), host_charset (),
-				       NULL));
+      wordobj = host_string_to_python_string (word);
       if (wordobj == NULL)
 	error (_("Could not convert argument to Python string."));
     }
diff --git a/gdb/python/py-frame.c b/gdb/python/py-frame.c
index ee57eb10576..b507ff0794f 100644
--- a/gdb/python/py-frame.c
+++ b/gdb/python/py-frame.c
@@ -131,8 +131,7 @@ frapy_name (PyObject *self, PyObject *args)
 
   if (name)
     {
-      result = PyUnicode_Decode (name.get (), strlen (name.get ()),
-				 host_charset (), NULL);
+      result = host_string_to_python_string (name.get ()).release ();
     }
   else
     {
@@ -658,7 +657,7 @@ gdbpy_frame_stop_reason_string (PyObject *self, PyObject *args)
     }
 
   str = unwind_stop_reason_to_string ((enum unwind_stop_reason) reason);
-  return PyUnicode_Decode (str, strlen (str), host_charset (), NULL);
+  return host_string_to_python_string (str).release ();
 }
 
 /* Implements the equality comparison for Frame objects.
diff --git a/gdb/python/py-type.c b/gdb/python/py-type.c
index 8b17b70fbe3..a178c6a4ab2 100644
--- a/gdb/python/py-type.c
+++ b/gdb/python/py-type.c
@@ -1033,8 +1033,7 @@ typy_str (PyObject *self)
       GDB_PY_HANDLE_EXCEPTION (except);
     }
 
-  return PyUnicode_Decode (thetype.c_str (), thetype.size (),
-			   host_charset (), NULL);
+  return host_string_to_python_string (thetype).release ();
 }
 
 /* Implement the richcompare method.  */
diff --git a/gdb/python/py-utils.c b/gdb/python/py-utils.c
index 10c4173efcd..2eb1ed2a09e 100644
--- a/gdb/python/py-utils.c
+++ b/gdb/python/py-utils.c
@@ -152,10 +152,9 @@ python_string_to_host_string (PyObject *obj)
 /* Convert a host string to a python string.  */
 
 gdbpy_ref<>
-host_string_to_python_string (const char *str)
+host_string_to_python_string (const char *str, size_t len)
 {
-  return gdbpy_ref<> (PyString_Decode (str, strlen (str), host_charset (),
-				       NULL));
+  return gdbpy_ref<> (PyString_Decode (str, len, host_charset (), nullptr));
 }
 
 /* Return true if OBJ is a Python string or unicode object, false
diff --git a/gdb/python/py-value.c b/gdb/python/py-value.c
index c843c2c3072..8bd30729454 100644
--- a/gdb/python/py-value.c
+++ b/gdb/python/py-value.c
@@ -743,7 +743,7 @@ valpy_format_string (PyObject *self, PyObject *args, PyObject *kw)
       GDB_PY_HANDLE_EXCEPTION (except);
     }
 
-  return PyUnicode_Decode (stb.c_str (), stb.size (), host_charset (), NULL);
+  return host_string_to_python_string (stb).release ();
 }
 
 /* A helper function that implements the various cast operators.  */
@@ -1149,7 +1149,7 @@ valpy_str (PyObject *self)
       GDB_PY_HANDLE_EXCEPTION (except);
     }
 
-  return PyUnicode_Decode (stb.c_str (), stb.size (), host_charset (), NULL);
+  return host_string_to_python_string (stb).release ();
 }
 
 /* Implements gdb.Value.is_optimized_out.  */
diff --git a/gdb/python/python-internal.h b/gdb/python/python-internal.h
index 211833e4b2d..195cb0a1896 100644
--- a/gdb/python/python-internal.h
+++ b/gdb/python/python-internal.h
@@ -22,6 +22,7 @@
 
 #include "extension.h"
 #include "extension-priv.h"
+#include "ui-file.h"
 
 /* These WITH_* macros are defined by the CPython API checker that
    comes with the Python plugin for GCC.  See:
@@ -715,7 +716,15 @@ gdb::unique_xmalloc_ptr<char> unicode_to_target_string (PyObject *unicode_str);
 gdb::unique_xmalloc_ptr<char> python_string_to_target_string (PyObject *obj);
 gdbpy_ref<> python_string_to_target_python_string (PyObject *obj);
 gdb::unique_xmalloc_ptr<char> python_string_to_host_string (PyObject *obj);
-gdbpy_ref<> host_string_to_python_string (const char *str);
+gdbpy_ref<> host_string_to_python_string (const char *str, size_t len);
+static inline gdbpy_ref<> host_string_to_python_string (const char *str)
+{
+  return host_string_to_python_string (str, strlen (str));
+}
+static inline gdbpy_ref<> host_string_to_python_string (const string_file &str)
+{
+  return host_string_to_python_string (str.c_str (), str.size ());
+}
 int gdbpy_is_string (PyObject *obj);
 gdb::unique_xmalloc_ptr<char> gdbpy_obj_to_string (PyObject *obj);
 
diff --git a/gdb/python/python.c b/gdb/python/python.c
index 82af012068b..6e85d30ed97 100644
--- a/gdb/python/python.c
+++ b/gdb/python/python.c
@@ -558,7 +558,7 @@ gdbpy_target_charset (PyObject *self, PyObject *args)
 {
   const char *cset = target_charset (python_gdbarch);
 
-  return PyUnicode_Decode (cset, strlen (cset), host_charset (), NULL);
+  return host_string_to_python_string (cset).release ();
 }
 
 /* Wrapper for target_wide_charset.  */
@@ -568,7 +568,7 @@ gdbpy_target_wide_charset (PyObject *self, PyObject *args)
 {
   const char *cset = target_wide_charset (python_gdbarch);
 
-  return PyUnicode_Decode (cset, strlen (cset), host_charset (), NULL);
+  return host_string_to_python_string (cset).release ();
 }
 
 /* A Python function which evaluates a string using the gdb CLI.  */
-- 
2.25.4


^ permalink raw reply	[flat|nested] 8+ messages in thread

* PING: [PATCH] gdb/python: make more use of host_string_to_python_string
  2021-12-06 18:10 [PATCH] gdb/python: make more use of host_string_to_python_string Andrew Burgess
@ 2021-12-22 15:08 ` Andrew Burgess
  2021-12-23  2:19 ` Simon Marchi
  2021-12-23 11:35 ` Andrew Burgess
  2 siblings, 0 replies; 8+ messages in thread
From: Andrew Burgess @ 2021-12-22 15:08 UTC (permalink / raw)
  To: gdb-patches

Ping!

Just in case anyone's still around :)

Thanks,
Andrew

* Andrew Burgess <aburgess@redhat.com> [2021-12-06 18:10:30 +0000]:

> We have a function host_string_to_python_string, which is a wrapper
> around PyString_Decode, which, on Python 3, is an alias for
> PyUnicode_Decode.
> 
> However, there are a few places where we still call PyUnicode_Decode
> directly.
> 
> This commit replaces all uses of PyUnicode_Decode with calls to
> host_string_to_python_string instead.
> 
> To make the use of host_string_to_python_string easier, I've added a
> couple of overloads for this function in python-internal.h, these all
> just forward their calls onto the single base implementation.  The
> signatures of all host_string_to_python_string overloads are:
> 
>   gdbpy_ref<> host_string_to_python_string (const char *str, size_t len);
>   gdbpy_ref<> host_string_to_python_string (const char *str)
>   gdbpy_ref<> host_string_to_python_string (const string_file &str)
> 
> For Python 3 PyString_Decode is setup (in python-internal.h) as an
> alias for PyUnicode_Decode, so there should certainly be no user
> visible changes in this case.
> 
> For Python 2 this commit will change the behaviour.  Previously by
> calling PyUnicode_Decode we would have been returning a Unicode
> object.  Now, after calling host_string_to_python_string, we will have
> a str object.
> 
> I've checked the GDB documentation, and, as far as I can tell, the
> methods I've touched were all documented as returning a string, or in
> the gdb.Command case, take a string as an argument, and my
> understanding is that for Python 2, string generally means str.  So I
> think the new behaviour would be more expected.
> 
> A different solution, that would also make things more consistent in
> the Python 2 world, would be to have host_string_to_python_string
> always return a Unicode object.  However, I've been reading this page:
> 
>   https://pythonhosted.org/kitchen/unicode-frustrations.html
> 
> item #3 recommends that unicode strings should be converted to str
> objects before being printed (in Python 2).  That would mean that
> users should have been adding .encode() calls to the output of the
> routines I've changed in this commit (if they wanted to print that
> output), which is not something I think is made clear from the GDB
> documentation.
> ---
>  gdb/python/py-cmd.c          |  9 +++------
>  gdb/python/py-frame.c        |  5 ++---
>  gdb/python/py-type.c         |  3 +--
>  gdb/python/py-utils.c        |  5 ++---
>  gdb/python/py-value.c        |  4 ++--
>  gdb/python/python-internal.h | 11 ++++++++++-
>  gdb/python/python.c          |  4 ++--
>  7 files changed, 22 insertions(+), 19 deletions(-)
> 
> diff --git a/gdb/python/py-cmd.c b/gdb/python/py-cmd.c
> index 94608e0bbcf..b2cafeba320 100644
> --- a/gdb/python/py-cmd.c
> +++ b/gdb/python/py-cmd.c
> @@ -120,8 +120,7 @@ cmdpy_function (const char *args, int from_tty, cmd_list_element *command)
>  
>    if (! args)
>      args = "";
> -  gdbpy_ref<> argobj (PyUnicode_Decode (args, strlen (args), host_charset (),
> -					NULL));
> +  gdbpy_ref<> argobj = host_string_to_python_string (args);
>    if (argobj == NULL)
>      {
>        gdbpy_print_stack ();
> @@ -181,8 +180,7 @@ cmdpy_completer_helper (struct cmd_list_element *command,
>        return NULL;
>      }
>  
> -  gdbpy_ref<> textobj (PyUnicode_Decode (text, strlen (text), host_charset (),
> -					 NULL));
> +  gdbpy_ref<> textobj = host_string_to_python_string (text);
>    if (textobj == NULL)
>      error (_("Could not convert argument to Python string."));
>  
> @@ -194,8 +192,7 @@ cmdpy_completer_helper (struct cmd_list_element *command,
>      }
>    else
>      {
> -      wordobj.reset (PyUnicode_Decode (word, strlen (word), host_charset (),
> -				       NULL));
> +      wordobj = host_string_to_python_string (word);
>        if (wordobj == NULL)
>  	error (_("Could not convert argument to Python string."));
>      }
> diff --git a/gdb/python/py-frame.c b/gdb/python/py-frame.c
> index ee57eb10576..b507ff0794f 100644
> --- a/gdb/python/py-frame.c
> +++ b/gdb/python/py-frame.c
> @@ -131,8 +131,7 @@ frapy_name (PyObject *self, PyObject *args)
>  
>    if (name)
>      {
> -      result = PyUnicode_Decode (name.get (), strlen (name.get ()),
> -				 host_charset (), NULL);
> +      result = host_string_to_python_string (name.get ()).release ();
>      }
>    else
>      {
> @@ -658,7 +657,7 @@ gdbpy_frame_stop_reason_string (PyObject *self, PyObject *args)
>      }
>  
>    str = unwind_stop_reason_to_string ((enum unwind_stop_reason) reason);
> -  return PyUnicode_Decode (str, strlen (str), host_charset (), NULL);
> +  return host_string_to_python_string (str).release ();
>  }
>  
>  /* Implements the equality comparison for Frame objects.
> diff --git a/gdb/python/py-type.c b/gdb/python/py-type.c
> index 8b17b70fbe3..a178c6a4ab2 100644
> --- a/gdb/python/py-type.c
> +++ b/gdb/python/py-type.c
> @@ -1033,8 +1033,7 @@ typy_str (PyObject *self)
>        GDB_PY_HANDLE_EXCEPTION (except);
>      }
>  
> -  return PyUnicode_Decode (thetype.c_str (), thetype.size (),
> -			   host_charset (), NULL);
> +  return host_string_to_python_string (thetype).release ();
>  }
>  
>  /* Implement the richcompare method.  */
> diff --git a/gdb/python/py-utils.c b/gdb/python/py-utils.c
> index 10c4173efcd..2eb1ed2a09e 100644
> --- a/gdb/python/py-utils.c
> +++ b/gdb/python/py-utils.c
> @@ -152,10 +152,9 @@ python_string_to_host_string (PyObject *obj)
>  /* Convert a host string to a python string.  */
>  
>  gdbpy_ref<>
> -host_string_to_python_string (const char *str)
> +host_string_to_python_string (const char *str, size_t len)
>  {
> -  return gdbpy_ref<> (PyString_Decode (str, strlen (str), host_charset (),
> -				       NULL));
> +  return gdbpy_ref<> (PyString_Decode (str, len, host_charset (), nullptr));
>  }
>  
>  /* Return true if OBJ is a Python string or unicode object, false
> diff --git a/gdb/python/py-value.c b/gdb/python/py-value.c
> index c843c2c3072..8bd30729454 100644
> --- a/gdb/python/py-value.c
> +++ b/gdb/python/py-value.c
> @@ -743,7 +743,7 @@ valpy_format_string (PyObject *self, PyObject *args, PyObject *kw)
>        GDB_PY_HANDLE_EXCEPTION (except);
>      }
>  
> -  return PyUnicode_Decode (stb.c_str (), stb.size (), host_charset (), NULL);
> +  return host_string_to_python_string (stb).release ();
>  }
>  
>  /* A helper function that implements the various cast operators.  */
> @@ -1149,7 +1149,7 @@ valpy_str (PyObject *self)
>        GDB_PY_HANDLE_EXCEPTION (except);
>      }
>  
> -  return PyUnicode_Decode (stb.c_str (), stb.size (), host_charset (), NULL);
> +  return host_string_to_python_string (stb).release ();
>  }
>  
>  /* Implements gdb.Value.is_optimized_out.  */
> diff --git a/gdb/python/python-internal.h b/gdb/python/python-internal.h
> index 211833e4b2d..195cb0a1896 100644
> --- a/gdb/python/python-internal.h
> +++ b/gdb/python/python-internal.h
> @@ -22,6 +22,7 @@
>  
>  #include "extension.h"
>  #include "extension-priv.h"
> +#include "ui-file.h"
>  
>  /* These WITH_* macros are defined by the CPython API checker that
>     comes with the Python plugin for GCC.  See:
> @@ -715,7 +716,15 @@ gdb::unique_xmalloc_ptr<char> unicode_to_target_string (PyObject *unicode_str);
>  gdb::unique_xmalloc_ptr<char> python_string_to_target_string (PyObject *obj);
>  gdbpy_ref<> python_string_to_target_python_string (PyObject *obj);
>  gdb::unique_xmalloc_ptr<char> python_string_to_host_string (PyObject *obj);
> -gdbpy_ref<> host_string_to_python_string (const char *str);
> +gdbpy_ref<> host_string_to_python_string (const char *str, size_t len);
> +static inline gdbpy_ref<> host_string_to_python_string (const char *str)
> +{
> +  return host_string_to_python_string (str, strlen (str));
> +}
> +static inline gdbpy_ref<> host_string_to_python_string (const string_file &str)
> +{
> +  return host_string_to_python_string (str.c_str (), str.size ());
> +}
>  int gdbpy_is_string (PyObject *obj);
>  gdb::unique_xmalloc_ptr<char> gdbpy_obj_to_string (PyObject *obj);
>  
> diff --git a/gdb/python/python.c b/gdb/python/python.c
> index 82af012068b..6e85d30ed97 100644
> --- a/gdb/python/python.c
> +++ b/gdb/python/python.c
> @@ -558,7 +558,7 @@ gdbpy_target_charset (PyObject *self, PyObject *args)
>  {
>    const char *cset = target_charset (python_gdbarch);
>  
> -  return PyUnicode_Decode (cset, strlen (cset), host_charset (), NULL);
> +  return host_string_to_python_string (cset).release ();
>  }
>  
>  /* Wrapper for target_wide_charset.  */
> @@ -568,7 +568,7 @@ gdbpy_target_wide_charset (PyObject *self, PyObject *args)
>  {
>    const char *cset = target_wide_charset (python_gdbarch);
>  
> -  return PyUnicode_Decode (cset, strlen (cset), host_charset (), NULL);
> +  return host_string_to_python_string (cset).release ();
>  }
>  
>  /* A Python function which evaluates a string using the gdb CLI.  */
> -- 
> 2.25.4
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] gdb/python: make more use of host_string_to_python_string
  2021-12-06 18:10 [PATCH] gdb/python: make more use of host_string_to_python_string Andrew Burgess
  2021-12-22 15:08 ` PING: " Andrew Burgess
@ 2021-12-23  2:19 ` Simon Marchi
  2021-12-23 11:07   ` Andrew Burgess
  2021-12-23 11:35 ` Andrew Burgess
  2 siblings, 1 reply; 8+ messages in thread
From: Simon Marchi @ 2021-12-23  2:19 UTC (permalink / raw)
  To: Andrew Burgess, gdb-patches



On 2021-12-06 13:10, Andrew Burgess via Gdb-patches wrote:
> We have a function host_string_to_python_string, which is a wrapper
> around PyString_Decode, which, on Python 3, is an alias for
> PyUnicode_Decode.
> 
> However, there are a few places where we still call PyUnicode_Decode
> directly.
> 
> This commit replaces all uses of PyUnicode_Decode with calls to
> host_string_to_python_string instead.
> 
> To make the use of host_string_to_python_string easier, I've added a
> couple of overloads for this function in python-internal.h, these all
> just forward their calls onto the single base implementation.  The
> signatures of all host_string_to_python_string overloads are:
> 
>   gdbpy_ref<> host_string_to_python_string (const char *str, size_t len);
>   gdbpy_ref<> host_string_to_python_string (const char *str)
>   gdbpy_ref<> host_string_to_python_string (const string_file &str)
> 
> For Python 3 PyString_Decode is setup (in python-internal.h) as an
> alias for PyUnicode_Decode, so there should certainly be no user
> visible changes in this case.
> 
> For Python 2 this commit will change the behaviour.  Previously by
> calling PyUnicode_Decode we would have been returning a Unicode
> object.  Now, after calling host_string_to_python_string, we will have
> a str object.
> 
> I've checked the GDB documentation, and, as far as I can tell, the
> methods I've touched were all documented as returning a string, or in
> the gdb.Command case, take a string as an argument, and my
> understanding is that for Python 2, string generally means str.  So I
> think the new behaviour would be more expected.
> 
> A different solution, that would also make things more consistent in
> the Python 2 world, would be to have host_string_to_python_string
> always return a Unicode object.  However, I've been reading this page:
> 
>   https://pythonhosted.org/kitchen/unicode-frustrations.html
> 
> item #3 recommends that unicode strings should be converted to str
> objects before being printed (in Python 2).  That would mean that
> users should have been adding .encode() calls to the output of the
> routines I've changed in this commit (if they wanted to print that
> output), which is not something I think is made clear from the GDB
> documentation.

I don't think we can (nor want) to change things currently returning
unicode objects to return str objects.  It changes significantly what
the user code receives.  If a given method/function has been returning a
unicode object since the dawn of time and we now return a string object,
user code expecting a unicode object will break.

Also, changing a PyUnicode_Decode call to host_string_to_python_string
(and therefore PyString_Decode) means that if the string contains
something not encodable in ascii, things will now fail  So that sounds
like a regression to me: where we could handle UTF-8 strings before, we
don't now.  See below for examples.

So my suggestion is to leave things as-is, avoid changing the behavior
for Python 2 until we finally remove Python 2 support (for which anybody
willing to do it has my automatic +1).

> ---
>  gdb/python/py-cmd.c          |  9 +++------
>  gdb/python/py-frame.c        |  5 ++---
>  gdb/python/py-type.c         |  3 +--
>  gdb/python/py-utils.c        |  5 ++---
>  gdb/python/py-value.c        |  4 ++--
>  gdb/python/python-internal.h | 11 ++++++++++-
>  gdb/python/python.c          |  4 ++--
>  7 files changed, 22 insertions(+), 19 deletions(-)
> 
> diff --git a/gdb/python/py-cmd.c b/gdb/python/py-cmd.c
> index 94608e0bbcf..b2cafeba320 100644
> --- a/gdb/python/py-cmd.c
> +++ b/gdb/python/py-cmd.c
> @@ -120,8 +120,7 @@ cmdpy_function (const char *args, int from_tty, cmd_list_element *command)
>  
>    if (! args)
>      args = "";
> -  gdbpy_ref<> argobj (PyUnicode_Decode (args, strlen (args), host_charset (),
> -					NULL));
> +  gdbpy_ref<> argobj = host_string_to_python_string (args);
>    if (argobj == NULL)
>      {
>        gdbpy_print_stack ();

Try with this in "script.py", where a user correctly called encode (as
you mentioned in the commit log) before printing args, a 'unicode'
object:

---
class MyCmd(gdb.Command):
    def __init__(self):
        super(MyCmd, self).__init__("my-cmd", gdb.COMMAND_USER)

    def invoke(self, args, from_tty):
        print(args.encode('utf-8'))

MyCmd()
---

$ ./gdb-old -q -nx --data-directory=data-directory -x script.py -ex 'my-cmd lél' -batch
lél
$ ./gdb-new -q -nx --data-directory=data-directory -x script.py -ex 'my-cmd lél' -batch
Python Exception <type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)
Could not convert arguments to Python string.

> @@ -181,8 +180,7 @@ cmdpy_completer_helper (struct cmd_list_element *command,
>        return NULL;
>      }
>  
> -  gdbpy_ref<> textobj (PyUnicode_Decode (text, strlen (text), host_charset (),
> -					 NULL));
> +  gdbpy_ref<> textobj = host_string_to_python_string (text);
>    if (textobj == NULL)
>      error (_("Could not convert argument to Python string."));
>  
> @@ -194,8 +192,7 @@ cmdpy_completer_helper (struct cmd_list_element *command,
>      }
>    else
>      {
> -      wordobj.reset (PyUnicode_Decode (word, strlen (word), host_charset (),
> -				       NULL));
> +      wordobj = host_string_to_python_string (word);
>        if (wordobj == NULL)
>  	error (_("Could not convert argument to Python string."));
>      }
> diff --git a/gdb/python/py-frame.c b/gdb/python/py-frame.c
> index ee57eb10576..b507ff0794f 100644
> --- a/gdb/python/py-frame.c
> +++ b/gdb/python/py-frame.c
> @@ -131,8 +131,7 @@ frapy_name (PyObject *self, PyObject *args)
>  
>    if (name)
>      {
> -      result = PyUnicode_Decode (name.get (), strlen (name.get ()),
> -				 host_charset (), NULL);
> +      result = host_string_to_python_string (name.get ()).release ();

Same here, you can have a frame with a non-ascii name, that would now
break.

>      }
>    else
>      {
> @@ -658,7 +657,7 @@ gdbpy_frame_stop_reason_string (PyObject *self, PyObject *args)
>      }
>  
>    str = unwind_stop_reason_to_string ((enum unwind_stop_reason) reason);
> -  return PyUnicode_Decode (str, strlen (str), host_charset (), NULL);
> +  return host_string_to_python_string (str).release ();

Stop reasons in english probably don't contain non-ascii characters, but
they are translated, so could, in theory, contain non-ascii characters.
So that would break here.


	$ ./gdb-old -nx -q --data-directory=data-directory a.out -ex start -ex s -ex 'python print(gdb.selected_frame().name().encode("utf-8"))'
	Reading symbols from a.out...
	Temporary breakpoint 1 at 0x1124: file test.cpp, line 7.
	Starting program: /home/simark/build/binutils-gdb-python2/gdb/a.out 

	Temporary breakpoint 1, main () at test.cpp:7
	7         lél();
	lél () at test.cpp:3
	3       }
	lél
	(gdb) 


	$ ./gdb-new -nx -q --data-directory=data-directory a.out -ex start -ex s -ex 'python print(gdb.selected_frame().name().encode("utf-8"))'
	Reading symbols from a.out...
	Temporary breakpoint 1 at 0x1124: file test.cpp, line 7.
	Starting program: /home/simark/build/binutils-gdb-python2/gdb/a.out 

	Temporary breakpoint 1, main () at test.cpp:7
	7         lél();
	lél () at test.cpp:3
	3       }
	Traceback (most recent call last):
	  File "<string>", line 1, in <module>
	UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)
	Error while executing Python code.

>  }
>  
>  /* Implements the equality comparison for Frame objects.
> diff --git a/gdb/python/py-type.c b/gdb/python/py-type.c
> index 8b17b70fbe3..a178c6a4ab2 100644
> --- a/gdb/python/py-type.c
> +++ b/gdb/python/py-type.c
> @@ -1033,8 +1033,7 @@ typy_str (PyObject *self)
>        GDB_PY_HANDLE_EXCEPTION (except);
>      }
>  
> -  return PyUnicode_Decode (thetype.c_str (), thetype.size (),
> -			   host_charset (), NULL);
> +  return host_string_to_python_string (thetype).release ();

Here, if the type name contains some non ascii characters, old GDB
would still break, as Python will try to call str() on the unicode
object, which (I guess) will try to encode it as ascii.  With new GDB,
the same will happen, but earlier, inside host_string_to_python_string.
If somebody called Type.__str__ directly, they would see the difference
though.

>  }
>  
>  /* Implement the richcompare method.  */
> diff --git a/gdb/python/py-utils.c b/gdb/python/py-utils.c
> index 10c4173efcd..2eb1ed2a09e 100644
> --- a/gdb/python/py-utils.c
> +++ b/gdb/python/py-utils.c
> @@ -152,10 +152,9 @@ python_string_to_host_string (PyObject *obj)
>  /* Convert a host string to a python string.  */
>  
>  gdbpy_ref<>
> -host_string_to_python_string (const char *str)
> +host_string_to_python_string (const char *str, size_t len)
>  {
> -  return gdbpy_ref<> (PyString_Decode (str, strlen (str), host_charset (),
> -				       NULL));
> +  return gdbpy_ref<> (PyString_Decode (str, len, host_charset (), nullptr));
>  }
>  
>  /* Return true if OBJ is a Python string or unicode object, false
> diff --git a/gdb/python/py-value.c b/gdb/python/py-value.c
> index c843c2c3072..8bd30729454 100644
> --- a/gdb/python/py-value.c
> +++ b/gdb/python/py-value.c
> @@ -743,7 +743,7 @@ valpy_format_string (PyObject *self, PyObject *args, PyObject *kw)
>        GDB_PY_HANDLE_EXCEPTION (except);
>      }
>  
> -  return PyUnicode_Decode (stb.c_str (), stb.size (), host_charset (), NULL);
> +  return host_string_to_python_string (stb).release ();

I guess that a value here could easily be a string with non-ascii
characters.  Also, note what the comment of the function says:

  /* Implementation of gdb.Value.format_string (...) -> string.
     Return Unicode string with value contents formatted using the
     keyword-only arguments.  */


Simon

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] gdb/python: make more use of host_string_to_python_string
  2021-12-23  2:19 ` Simon Marchi
@ 2021-12-23 11:07   ` Andrew Burgess
  2021-12-24  3:10     ` Simon Marchi
  0 siblings, 1 reply; 8+ messages in thread
From: Andrew Burgess @ 2021-12-23 11:07 UTC (permalink / raw)
  To: Simon Marchi; +Cc: gdb-patches

* Simon Marchi <simon.marchi@polymtl.ca> [2021-12-22 21:19:41 -0500]:

> 
> 
> On 2021-12-06 13:10, Andrew Burgess via Gdb-patches wrote:
> > We have a function host_string_to_python_string, which is a wrapper
> > around PyString_Decode, which, on Python 3, is an alias for
> > PyUnicode_Decode.
> > 
> > However, there are a few places where we still call PyUnicode_Decode
> > directly.
> > 
> > This commit replaces all uses of PyUnicode_Decode with calls to
> > host_string_to_python_string instead.
> > 
> > To make the use of host_string_to_python_string easier, I've added a
> > couple of overloads for this function in python-internal.h, these all
> > just forward their calls onto the single base implementation.  The
> > signatures of all host_string_to_python_string overloads are:
> > 
> >   gdbpy_ref<> host_string_to_python_string (const char *str, size_t len);
> >   gdbpy_ref<> host_string_to_python_string (const char *str)
> >   gdbpy_ref<> host_string_to_python_string (const string_file &str)
> > 
> > For Python 3 PyString_Decode is setup (in python-internal.h) as an
> > alias for PyUnicode_Decode, so there should certainly be no user
> > visible changes in this case.
> > 
> > For Python 2 this commit will change the behaviour.  Previously by
> > calling PyUnicode_Decode we would have been returning a Unicode
> > object.  Now, after calling host_string_to_python_string, we will have
> > a str object.
> > 
> > I've checked the GDB documentation, and, as far as I can tell, the
> > methods I've touched were all documented as returning a string, or in
> > the gdb.Command case, take a string as an argument, and my
> > understanding is that for Python 2, string generally means str.  So I
> > think the new behaviour would be more expected.
> > 
> > A different solution, that would also make things more consistent in
> > the Python 2 world, would be to have host_string_to_python_string
> > always return a Unicode object.  However, I've been reading this page:
> > 
> >   https://pythonhosted.org/kitchen/unicode-frustrations.html
> > 
> > item #3 recommends that unicode strings should be converted to str
> > objects before being printed (in Python 2).  That would mean that
> > users should have been adding .encode() calls to the output of the
> > routines I've changed in this commit (if they wanted to print that
> > output), which is not something I think is made clear from the GDB
> > documentation.
> 
> I don't think we can (nor want) to change things currently returning
> unicode objects to return str objects.  It changes significantly what
> the user code receives.  If a given method/function has been returning a
> unicode object since the dawn of time and we now return a string object,
> user code expecting a unicode object will break.

ACK.  I think I disagree about the significance of the change, but I
understand your concerns.

> 
> Also, changing a PyUnicode_Decode call to host_string_to_python_string
> (and therefore PyString_Decode) means that if the string contains
> something not encodable in ascii, things will now fail  So that sounds
> like a regression to me: where we could handle UTF-8 strings before, we
> don't now.  See below for examples.

Yeah, you're absolutely correct.  I don't understand why we would want
to call PyString_Decode here for Py2 at all, maybe you understand why
that's a good thing?

If I rewrite the function like this:

  /* Convert a host string to a python string.  */

  gdbpy_ref<>
  host_string_to_python_string (const char *str, size_t len)
  {
  #ifdef IS_PY3K
    return gdbpy_ref<> (PyString_Decode (str, len, host_charset (), nullptr));
  #else
    return gdbpy_ref<> (PyString_FromStringAndSize(str, len));
  #endif
  }

Then I think this would achieve everything I actually intended this
patch to achieve.  Obviously, this still doesn't address your concern
about the unicode -> str change (for Py2), so I doubt you'd now find
the patch acceptable.

That said, I suspect changing host_string_to_python_string as in the
above would probably be a good thing, right?  The above function is
used, so all I need to do is inject some non ASCII characters into a
code path that calls the above function and the existing GDB will
break, but the above change would allow things to work correctly.

> 
> So my suggestion is to leave things as-is, avoid changing the behavior
> for Python 2 until we finally remove Python 2 support (for which anybody
> willing to do it has my automatic +1).

I certainly wouldn't want to stop someone removing Py2 support.

> 
> > ---
> >  gdb/python/py-cmd.c          |  9 +++------
> >  gdb/python/py-frame.c        |  5 ++---
> >  gdb/python/py-type.c         |  3 +--
> >  gdb/python/py-utils.c        |  5 ++---
> >  gdb/python/py-value.c        |  4 ++--
> >  gdb/python/python-internal.h | 11 ++++++++++-
> >  gdb/python/python.c          |  4 ++--
> >  7 files changed, 22 insertions(+), 19 deletions(-)
> > 
> > diff --git a/gdb/python/py-cmd.c b/gdb/python/py-cmd.c
> > index 94608e0bbcf..b2cafeba320 100644
> > --- a/gdb/python/py-cmd.c
> > +++ b/gdb/python/py-cmd.c
> > @@ -120,8 +120,7 @@ cmdpy_function (const char *args, int from_tty, cmd_list_element *command)
> >  
> >    if (! args)
> >      args = "";
> > -  gdbpy_ref<> argobj (PyUnicode_Decode (args, strlen (args), host_charset (),
> > -					NULL));
> > +  gdbpy_ref<> argobj = host_string_to_python_string (args);
> >    if (argobj == NULL)
> >      {
> >        gdbpy_print_stack ();
> 
> Try with this in "script.py", where a user correctly called encode (as
> you mentioned in the commit log) before printing args, a 'unicode'
> object:
> 
> ---
> class MyCmd(gdb.Command):
>     def __init__(self):
>         super(MyCmd, self).__init__("my-cmd", gdb.COMMAND_USER)
> 
>     def invoke(self, args, from_tty):
>         print(args.encode('utf-8'))

Really this should be:

  def invoke(self, args, from_tty):
      print(args.encode(gdb.host_charset ()))

Except we don't have a gdb.host_charset method, otherwise it should be
possible for this code to go wrong.  I'll write a patch to add that.

I assume you'll not object if I propose updating the documentation for
all the functions I tried to change here to document the actual
behaviour?

Thanks for your in depth analysis.

Andrew

> 
> MyCmd()
> ---
> 
> $ ./gdb-old -q -nx --data-directory=data-directory -x script.py -ex 'my-cmd lél' -batch
> lél
> $ ./gdb-new -q -nx --data-directory=data-directory -x script.py -ex 'my-cmd lél' -batch
> Python Exception <type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)
> Could not convert arguments to Python string.
> 
> > @@ -181,8 +180,7 @@ cmdpy_completer_helper (struct cmd_list_element *command,
> >        return NULL;
> >      }
> >  
> > -  gdbpy_ref<> textobj (PyUnicode_Decode (text, strlen (text), host_charset (),
> > -					 NULL));
> > +  gdbpy_ref<> textobj = host_string_to_python_string (text);
> >    if (textobj == NULL)
> >      error (_("Could not convert argument to Python string."));
> >  
> > @@ -194,8 +192,7 @@ cmdpy_completer_helper (struct cmd_list_element *command,
> >      }
> >    else
> >      {
> > -      wordobj.reset (PyUnicode_Decode (word, strlen (word), host_charset (),
> > -				       NULL));
> > +      wordobj = host_string_to_python_string (word);
> >        if (wordobj == NULL)
> >  	error (_("Could not convert argument to Python string."));
> >      }
> > diff --git a/gdb/python/py-frame.c b/gdb/python/py-frame.c
> > index ee57eb10576..b507ff0794f 100644
> > --- a/gdb/python/py-frame.c
> > +++ b/gdb/python/py-frame.c
> > @@ -131,8 +131,7 @@ frapy_name (PyObject *self, PyObject *args)
> >  
> >    if (name)
> >      {
> > -      result = PyUnicode_Decode (name.get (), strlen (name.get ()),
> > -				 host_charset (), NULL);
> > +      result = host_string_to_python_string (name.get ()).release ();
> 
> Same here, you can have a frame with a non-ascii name, that would now
> break.
> 
> >      }
> >    else
> >      {
> > @@ -658,7 +657,7 @@ gdbpy_frame_stop_reason_string (PyObject *self, PyObject *args)
> >      }
> >  
> >    str = unwind_stop_reason_to_string ((enum unwind_stop_reason) reason);
> > -  return PyUnicode_Decode (str, strlen (str), host_charset (), NULL);
> > +  return host_string_to_python_string (str).release ();
> 
> Stop reasons in english probably don't contain non-ascii characters, but
> they are translated, so could, in theory, contain non-ascii characters.
> So that would break here.
> 
> 
> 	$ ./gdb-old -nx -q --data-directory=data-directory a.out -ex start -ex s -ex 'python print(gdb.selected_frame().name().encode("utf-8"))'
> 	Reading symbols from a.out...
> 	Temporary breakpoint 1 at 0x1124: file test.cpp, line 7.
> 	Starting program: /home/simark/build/binutils-gdb-python2/gdb/a.out 
> 
> 	Temporary breakpoint 1, main () at test.cpp:7
> 	7         lél();
> 	lél () at test.cpp:3
> 	3       }
> 	lél
> 	(gdb) 
> 
> 
> 	$ ./gdb-new -nx -q --data-directory=data-directory a.out -ex start -ex s -ex 'python print(gdb.selected_frame().name().encode("utf-8"))'
> 	Reading symbols from a.out...
> 	Temporary breakpoint 1 at 0x1124: file test.cpp, line 7.
> 	Starting program: /home/simark/build/binutils-gdb-python2/gdb/a.out 
> 
> 	Temporary breakpoint 1, main () at test.cpp:7
> 	7         lél();
> 	lél () at test.cpp:3
> 	3       }
> 	Traceback (most recent call last):
> 	  File "<string>", line 1, in <module>
> 	UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)
> 	Error while executing Python code.
> 
> >  }
> >  
> >  /* Implements the equality comparison for Frame objects.
> > diff --git a/gdb/python/py-type.c b/gdb/python/py-type.c
> > index 8b17b70fbe3..a178c6a4ab2 100644
> > --- a/gdb/python/py-type.c
> > +++ b/gdb/python/py-type.c
> > @@ -1033,8 +1033,7 @@ typy_str (PyObject *self)
> >        GDB_PY_HANDLE_EXCEPTION (except);
> >      }
> >  
> > -  return PyUnicode_Decode (thetype.c_str (), thetype.size (),
> > -			   host_charset (), NULL);
> > +  return host_string_to_python_string (thetype).release ();
> 
> Here, if the type name contains some non ascii characters, old GDB
> would still break, as Python will try to call str() on the unicode
> object, which (I guess) will try to encode it as ascii.  With new GDB,
> the same will happen, but earlier, inside host_string_to_python_string.
> If somebody called Type.__str__ directly, they would see the difference
> though.
> 
> >  }
> >  
> >  /* Implement the richcompare method.  */
> > diff --git a/gdb/python/py-utils.c b/gdb/python/py-utils.c
> > index 10c4173efcd..2eb1ed2a09e 100644
> > --- a/gdb/python/py-utils.c
> > +++ b/gdb/python/py-utils.c
> > @@ -152,10 +152,9 @@ python_string_to_host_string (PyObject *obj)
> >  /* Convert a host string to a python string.  */
> >  
> >  gdbpy_ref<>
> > -host_string_to_python_string (const char *str)
> > +host_string_to_python_string (const char *str, size_t len)
> >  {
> > -  return gdbpy_ref<> (PyString_Decode (str, strlen (str), host_charset (),
> > -				       NULL));
> > +  return gdbpy_ref<> (PyString_Decode (str, len, host_charset (), nullptr));
> >  }
> >  
> >  /* Return true if OBJ is a Python string or unicode object, false
> > diff --git a/gdb/python/py-value.c b/gdb/python/py-value.c
> > index c843c2c3072..8bd30729454 100644
> > --- a/gdb/python/py-value.c
> > +++ b/gdb/python/py-value.c
> > @@ -743,7 +743,7 @@ valpy_format_string (PyObject *self, PyObject *args, PyObject *kw)
> >        GDB_PY_HANDLE_EXCEPTION (except);
> >      }
> >  
> > -  return PyUnicode_Decode (stb.c_str (), stb.size (), host_charset (), NULL);
> > +  return host_string_to_python_string (stb).release ();
> 
> I guess that a value here could easily be a string with non-ascii
> characters.  Also, note what the comment of the function says:
> 
>   /* Implementation of gdb.Value.format_string (...) -> string.
>      Return Unicode string with value contents formatted using the
>      keyword-only arguments.  */
> 
> 
> Simon
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] gdb/python: make more use of host_string_to_python_string
  2021-12-06 18:10 [PATCH] gdb/python: make more use of host_string_to_python_string Andrew Burgess
  2021-12-22 15:08 ` PING: " Andrew Burgess
  2021-12-23  2:19 ` Simon Marchi
@ 2021-12-23 11:35 ` Andrew Burgess
  2 siblings, 0 replies; 8+ messages in thread
From: Andrew Burgess @ 2021-12-23 11:35 UTC (permalink / raw)
  To: gdb-patches

* Andrew Burgess <aburgess@redhat.com> [2021-12-06 18:10:30 +0000]:

> We have a function host_string_to_python_string, which is a wrapper
> around PyString_Decode, which, on Python 3, is an alias for
> PyUnicode_Decode.
> 
> However, there are a few places where we still call PyUnicode_Decode
> directly.
> 
> This commit replaces all uses of PyUnicode_Decode with calls to
> host_string_to_python_string instead.
> 
> To make the use of host_string_to_python_string easier, I've added a
> couple of overloads for this function in python-internal.h, these all
> just forward their calls onto the single base implementation.  The
> signatures of all host_string_to_python_string overloads are:
> 
>   gdbpy_ref<> host_string_to_python_string (const char *str, size_t len);
>   gdbpy_ref<> host_string_to_python_string (const char *str)
>   gdbpy_ref<> host_string_to_python_string (const string_file &str)
> 
> For Python 3 PyString_Decode is setup (in python-internal.h) as an
> alias for PyUnicode_Decode, so there should certainly be no user
> visible changes in this case.
> 
> For Python 2 this commit will change the behaviour.  Previously by
> calling PyUnicode_Decode we would have been returning a Unicode
> object.  Now, after calling host_string_to_python_string, we will have
> a str object.
> 
> I've checked the GDB documentation, and, as far as I can tell, the
> methods I've touched were all documented as returning a string, or in
> the gdb.Command case, take a string as an argument, and my
> understanding is that for Python 2, string generally means str.  So I
> think the new behaviour would be more expected.
> 
> A different solution, that would also make things more consistent in
> the Python 2 world, would be to have host_string_to_python_string
> always return a Unicode object.  However, I've been reading this page:
> 
>   https://pythonhosted.org/kitchen/unicode-frustrations.html
> 
> item #3 recommends that unicode strings should be converted to str
> objects before being printed (in Python 2).  That would mean that
> users should have been adding .encode() calls to the output of the
> routines I've changed in this commit (if they wanted to print that
> output), which is not something I think is made clear from the GDB
> documentation.
> ---
>  gdb/python/py-cmd.c          |  9 +++------
>  gdb/python/py-frame.c        |  5 ++---
>  gdb/python/py-type.c         |  3 +--
>  gdb/python/py-utils.c        |  5 ++---
>  gdb/python/py-value.c        |  4 ++--
>  gdb/python/python-internal.h | 11 ++++++++++-
>  gdb/python/python.c          |  4 ++--
>  7 files changed, 22 insertions(+), 19 deletions(-)
> 
> diff --git a/gdb/python/py-cmd.c b/gdb/python/py-cmd.c
> index 94608e0bbcf..b2cafeba320 100644
> --- a/gdb/python/py-cmd.c
> +++ b/gdb/python/py-cmd.c
> @@ -120,8 +120,7 @@ cmdpy_function (const char *args, int from_tty, cmd_list_element *command)
>  
>    if (! args)
>      args = "";
> -  gdbpy_ref<> argobj (PyUnicode_Decode (args, strlen (args), host_charset (),
> -					NULL));
> +  gdbpy_ref<> argobj = host_string_to_python_string (args);
>    if (argobj == NULL)
>      {
>        gdbpy_print_stack ();
> @@ -181,8 +180,7 @@ cmdpy_completer_helper (struct cmd_list_element *command,
>        return NULL;
>      }
>  
> -  gdbpy_ref<> textobj (PyUnicode_Decode (text, strlen (text), host_charset (),
> -					 NULL));
> +  gdbpy_ref<> textobj = host_string_to_python_string (text);
>    if (textobj == NULL)
>      error (_("Could not convert argument to Python string."));
>  
> @@ -194,8 +192,7 @@ cmdpy_completer_helper (struct cmd_list_element *command,
>      }
>    else
>      {
> -      wordobj.reset (PyUnicode_Decode (word, strlen (word), host_charset (),
> -				       NULL));
> +      wordobj = host_string_to_python_string (word);
>        if (wordobj == NULL)
>  	error (_("Could not convert argument to Python string."));
>      }
> diff --git a/gdb/python/py-frame.c b/gdb/python/py-frame.c
> index ee57eb10576..b507ff0794f 100644
> --- a/gdb/python/py-frame.c
> +++ b/gdb/python/py-frame.c
> @@ -131,8 +131,7 @@ frapy_name (PyObject *self, PyObject *args)
>  
>    if (name)
>      {
> -      result = PyUnicode_Decode (name.get (), strlen (name.get ()),
> -				 host_charset (), NULL);
> +      result = host_string_to_python_string (name.get ()).release ();
>      }
>    else
>      {
> @@ -658,7 +657,7 @@ gdbpy_frame_stop_reason_string (PyObject *self, PyObject *args)
>      }
>  
>    str = unwind_stop_reason_to_string ((enum unwind_stop_reason) reason);
> -  return PyUnicode_Decode (str, strlen (str), host_charset (), NULL);
> +  return host_string_to_python_string (str).release ();
>  }
>  
>  /* Implements the equality comparison for Frame objects.
> diff --git a/gdb/python/py-type.c b/gdb/python/py-type.c
> index 8b17b70fbe3..a178c6a4ab2 100644
> --- a/gdb/python/py-type.c
> +++ b/gdb/python/py-type.c
> @@ -1033,8 +1033,7 @@ typy_str (PyObject *self)
>        GDB_PY_HANDLE_EXCEPTION (except);
>      }
>  
> -  return PyUnicode_Decode (thetype.c_str (), thetype.size (),
> -			   host_charset (), NULL);
> +  return host_string_to_python_string (thetype).release ();
>  }
>  
>  /* Implement the richcompare method.  */
> diff --git a/gdb/python/py-utils.c b/gdb/python/py-utils.c
> index 10c4173efcd..2eb1ed2a09e 100644
> --- a/gdb/python/py-utils.c
> +++ b/gdb/python/py-utils.c
> @@ -152,10 +152,9 @@ python_string_to_host_string (PyObject *obj)
>  /* Convert a host string to a python string.  */
>  
>  gdbpy_ref<>
> -host_string_to_python_string (const char *str)
> +host_string_to_python_string (const char *str, size_t len)
>  {
> -  return gdbpy_ref<> (PyString_Decode (str, strlen (str), host_charset (),
> -				       NULL));
> +  return gdbpy_ref<> (PyString_Decode (str, len, host_charset (), nullptr));
>  }
>  
>  /* Return true if OBJ is a Python string or unicode object, false
> diff --git a/gdb/python/py-value.c b/gdb/python/py-value.c
> index c843c2c3072..8bd30729454 100644
> --- a/gdb/python/py-value.c
> +++ b/gdb/python/py-value.c
> @@ -743,7 +743,7 @@ valpy_format_string (PyObject *self, PyObject *args, PyObject *kw)
>        GDB_PY_HANDLE_EXCEPTION (except);
>      }
>  
> -  return PyUnicode_Decode (stb.c_str (), stb.size (), host_charset (), NULL);
> +  return host_string_to_python_string (stb).release ();
>  }
>  
>  /* A helper function that implements the various cast operators.  */
> @@ -1149,7 +1149,7 @@ valpy_str (PyObject *self)
>        GDB_PY_HANDLE_EXCEPTION (except);
>      }
>  
> -  return PyUnicode_Decode (stb.c_str (), stb.size (), host_charset (), NULL);
> +  return host_string_to_python_string (stb).release ();
>  }
>  
>  /* Implements gdb.Value.is_optimized_out.  */
> diff --git a/gdb/python/python-internal.h b/gdb/python/python-internal.h
> index 211833e4b2d..195cb0a1896 100644
> --- a/gdb/python/python-internal.h
> +++ b/gdb/python/python-internal.h
> @@ -22,6 +22,7 @@
>  
>  #include "extension.h"
>  #include "extension-priv.h"
> +#include "ui-file.h"
>  
>  /* These WITH_* macros are defined by the CPython API checker that
>     comes with the Python plugin for GCC.  See:
> @@ -715,7 +716,15 @@ gdb::unique_xmalloc_ptr<char> unicode_to_target_string (PyObject *unicode_str);
>  gdb::unique_xmalloc_ptr<char> python_string_to_target_string (PyObject *obj);
>  gdbpy_ref<> python_string_to_target_python_string (PyObject *obj);
>  gdb::unique_xmalloc_ptr<char> python_string_to_host_string (PyObject *obj);
> -gdbpy_ref<> host_string_to_python_string (const char *str);
> +gdbpy_ref<> host_string_to_python_string (const char *str, size_t len);
> +static inline gdbpy_ref<> host_string_to_python_string (const char *str)
> +{
> +  return host_string_to_python_string (str, strlen (str));
> +}
> +static inline gdbpy_ref<> host_string_to_python_string (const string_file &str)
> +{
> +  return host_string_to_python_string (str.c_str (), str.size ());
> +}
>  int gdbpy_is_string (PyObject *obj);
>  gdb::unique_xmalloc_ptr<char> gdbpy_obj_to_string (PyObject *obj);
>  
> diff --git a/gdb/python/python.c b/gdb/python/python.c
> index 82af012068b..6e85d30ed97 100644
> --- a/gdb/python/python.c
> +++ b/gdb/python/python.c
> @@ -558,7 +558,7 @@ gdbpy_target_charset (PyObject *self, PyObject *args)
>  {
>    const char *cset = target_charset (python_gdbarch);
>  
> -  return PyUnicode_Decode (cset, strlen (cset), host_charset (), NULL);
> +  return host_string_to_python_string (cset).release ();
>  }

I wonder if we are over using 'host_charset ()'?  Consider this
session:

  gdb) set host-charset UTF-32
  (gdb) python print(gdb.target_charset ())
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
  UnicodeDecodeError: 'utf-32-le' codec can't decode bytes in position 0-3: code point not in range(0x110000)
  Error while executing Python code.
  (gdb) 

Because the string we're encoding is not in host_charset, but is an
ascii string from within GDB.  I wonder if in situations like this we
should actually be saying:

  static PyObject *
  gdbpy_target_charset (PyObject *self, PyObject *args)
  {
    const char *cset = target_charset (python_gdbarch);

    return PyUnicode_Decode (cset, strlen (cset), "utf-8", NULL);
  }

Then this function should be documented as returning a utf-8 unicode
object.

Though, having said that, I wonder if the real problem is with 'set
host-charset'?  Consider:

  (gdb) help set host-charset
  Set the host character set.
  The `host character set' is the one used by the system GDB is running on.
  You may only use supersets of ASCII for your host character set; GDB does
  not support any others.
  To see a list of the character sets GDB supports, type `set host-charset <TAB>'.

It's not clear what a superset of ASCII is, but I would have assumed
it meant a character set where the bytes for any ascii character are
unchanged in the unicode encoding, which, as I understand it, is not
the case for things like utf-32.  So maybe superset of ASCII means
something different... or maybe I'm not understanding utf-32
correctly...

Thanks,
Andrew


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] gdb/python: make more use of host_string_to_python_string
  2021-12-23 11:07   ` Andrew Burgess
@ 2021-12-24  3:10     ` Simon Marchi
  2021-12-24 12:20       ` Andrew Burgess
  0 siblings, 1 reply; 8+ messages in thread
From: Simon Marchi @ 2021-12-24  3:10 UTC (permalink / raw)
  To: Andrew Burgess; +Cc: gdb-patches



On 2021-12-23 06:07, Andrew Burgess wrote:
> * Simon Marchi <simon.marchi@polymtl.ca> [2021-12-22 21:19:41 -0500]:
> 
>>
>>
>> On 2021-12-06 13:10, Andrew Burgess via Gdb-patches wrote:
>>> We have a function host_string_to_python_string, which is a wrapper
>>> around PyString_Decode, which, on Python 3, is an alias for
>>> PyUnicode_Decode.
>>>
>>> However, there are a few places where we still call PyUnicode_Decode
>>> directly.
>>>
>>> This commit replaces all uses of PyUnicode_Decode with calls to
>>> host_string_to_python_string instead.
>>>
>>> To make the use of host_string_to_python_string easier, I've added a
>>> couple of overloads for this function in python-internal.h, these all
>>> just forward their calls onto the single base implementation.  The
>>> signatures of all host_string_to_python_string overloads are:
>>>
>>>   gdbpy_ref<> host_string_to_python_string (const char *str, size_t len);
>>>   gdbpy_ref<> host_string_to_python_string (const char *str)
>>>   gdbpy_ref<> host_string_to_python_string (const string_file &str)
>>>
>>> For Python 3 PyString_Decode is setup (in python-internal.h) as an
>>> alias for PyUnicode_Decode, so there should certainly be no user
>>> visible changes in this case.
>>>
>>> For Python 2 this commit will change the behaviour.  Previously by
>>> calling PyUnicode_Decode we would have been returning a Unicode
>>> object.  Now, after calling host_string_to_python_string, we will have
>>> a str object.
>>>
>>> I've checked the GDB documentation, and, as far as I can tell, the
>>> methods I've touched were all documented as returning a string, or in
>>> the gdb.Command case, take a string as an argument, and my
>>> understanding is that for Python 2, string generally means str.  So I
>>> think the new behaviour would be more expected.
>>>
>>> A different solution, that would also make things more consistent in
>>> the Python 2 world, would be to have host_string_to_python_string
>>> always return a Unicode object.  However, I've been reading this page:
>>>
>>>   https://pythonhosted.org/kitchen/unicode-frustrations.html
>>>
>>> item #3 recommends that unicode strings should be converted to str
>>> objects before being printed (in Python 2).  That would mean that
>>> users should have been adding .encode() calls to the output of the
>>> routines I've changed in this commit (if they wanted to print that
>>> output), which is not something I think is made clear from the GDB
>>> documentation.
>>
>> I don't think we can (nor want) to change things currently returning
>> unicode objects to return str objects.  It changes significantly what
>> the user code receives.  If a given method/function has been returning a
>> unicode object since the dawn of time and we now return a string object,
>> user code expecting a unicode object will break.
> 
> ACK.  I think I disagree about the significance of the change, but I
> understand your concerns.

I fail to see how you don't find it significant: this is a public API,
we change the type of the returned value, doesn't it break existing
users?

>> Also, changing a PyUnicode_Decode call to host_string_to_python_string
>> (and therefore PyString_Decode) means that if the string contains
>> something not encodable in ascii, things will now fail  So that sounds
>> like a regression to me: where we could handle UTF-8 strings before, we
>> don't now.  See below for examples.
> 
> Yeah, you're absolutely correct.  I don't understand why we would want
> to call PyString_Decode here for Py2 at all, maybe you understand why
> that's a good thing?
> 
> If I rewrite the function like this:
> 
>   /* Convert a host string to a python string.  */
> 
>   gdbpy_ref<>
>   host_string_to_python_string (const char *str, size_t len)
>   {
>   #ifdef IS_PY3K
>     return gdbpy_ref<> (PyString_Decode (str, len, host_charset (), nullptr));
>   #else
>     return gdbpy_ref<> (PyString_FromStringAndSize(str, len));
>   #endif
>   }
> 
> Then I think this would achieve everything I actually intended this
> patch to achieve.

Can you clarify what the patch is trying to achieve?  I don't think I
understand properly.

Given that Python 2's string type is equivalent to Python 3's bytes
type, and Python 2's unicode type is equivalent to Python 3's string
type, I find it odd to have host_string_to_python_string (and some
Python API functions) return a "string" both in Python 2 and Python 3,
as they are both fundamentally different types.  Returning a unicode in
Python 2 and a string in Python 3 makes sense to me, they are basically
the same  thing.

> Obviously, this still doesn't address your concern
> about the unicode -> str change (for Py2), so I doubt you'd now find
> the patch acceptable.
> 
> That said, I suspect changing host_string_to_python_string as in the
> above would probably be a good thing, right?  The above function is
> used, so all I need to do is inject some non ASCII characters into a
> code path that calls the above function and the existing GDB will
> break, but the above change would allow things to work correctly.

Code paths that do use this function already get a "str" in both Python
2 and 3 (which I think is wrong, as explained above, but that's what we
have to deal with) and would still receive a "str" after, so the change
is is safe from that point of view.  Today, a non-ASCII character in the
input argument just causes an exception with Python 2.  With the change
you propose, it would return an "str" containing whatever bytes came in,
in the host system's encoding.  So, the new behavior seems like an
improvement at first sight.

>> So my suggestion is to leave things as-is, avoid changing the behavior
>> for Python 2 until we finally remove Python 2 support (for which anybody
>> willing to do it has my automatic +1).
> 
> I certainly wouldn't want to stop someone removing Py2 support.

I'll take a look at that, many people have suggested it, and this is an
example of how keeping Python 2 support is a burden.

>>> ---
>>>  gdb/python/py-cmd.c          |  9 +++------
>>>  gdb/python/py-frame.c        |  5 ++---
>>>  gdb/python/py-type.c         |  3 +--
>>>  gdb/python/py-utils.c        |  5 ++---
>>>  gdb/python/py-value.c        |  4 ++--
>>>  gdb/python/python-internal.h | 11 ++++++++++-
>>>  gdb/python/python.c          |  4 ++--
>>>  7 files changed, 22 insertions(+), 19 deletions(-)
>>>
>>> diff --git a/gdb/python/py-cmd.c b/gdb/python/py-cmd.c
>>> index 94608e0bbcf..b2cafeba320 100644
>>> --- a/gdb/python/py-cmd.c
>>> +++ b/gdb/python/py-cmd.c
>>> @@ -120,8 +120,7 @@ cmdpy_function (const char *args, int from_tty, cmd_list_element *command)
>>>  
>>>    if (! args)
>>>      args = "";
>>> -  gdbpy_ref<> argobj (PyUnicode_Decode (args, strlen (args), host_charset (),
>>> -					NULL));
>>> +  gdbpy_ref<> argobj = host_string_to_python_string (args);
>>>    if (argobj == NULL)
>>>      {
>>>        gdbpy_print_stack ();
>>
>> Try with this in "script.py", where a user correctly called encode (as
>> you mentioned in the commit log) before printing args, a 'unicode'
>> object:
>>
>> ---
>> class MyCmd(gdb.Command):
>>     def __init__(self):
>>         super(MyCmd, self).__init__("my-cmd", gdb.COMMAND_USER)
>>
>>     def invoke(self, args, from_tty):
>>         print(args.encode('utf-8'))
> 
> Really this should be:
> 
>   def invoke(self, args, from_tty):
>       print(args.encode(gdb.host_charset ()))
> 
> Except we don't have a gdb.host_charset method, otherwise it should be
> possible for this code to go wrong.

True.  Although somebody could still use .encode('utf-8') and just use
that script on machines where UTF-8 is the locale (which is just the
norm today).

> I'll write a patch to add that.
> 
> I assume you'll not object if I propose updating the documentation for
> all the functions I tried to change here to document the actual
> behaviour?

Sure.  Although if we end up removing Python 2 support (which is not a
given), it might be unnecessary.

Simon

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] gdb/python: make more use of host_string_to_python_string
  2021-12-24  3:10     ` Simon Marchi
@ 2021-12-24 12:20       ` Andrew Burgess
  2021-12-27  1:30         ` Simon Marchi
  0 siblings, 1 reply; 8+ messages in thread
From: Andrew Burgess @ 2021-12-24 12:20 UTC (permalink / raw)
  To: Simon Marchi; +Cc: gdb-patches

* Simon Marchi <simon.marchi@polymtl.ca> [2021-12-23 22:10:00 -0500]:

> 
> 
> On 2021-12-23 06:07, Andrew Burgess wrote:
> > * Simon Marchi <simon.marchi@polymtl.ca> [2021-12-22 21:19:41 -0500]:
> > 
> >>
> >>
> >> On 2021-12-06 13:10, Andrew Burgess via Gdb-patches wrote:
> >>> We have a function host_string_to_python_string, which is a wrapper
> >>> around PyString_Decode, which, on Python 3, is an alias for
> >>> PyUnicode_Decode.
> >>>
> >>> However, there are a few places where we still call PyUnicode_Decode
> >>> directly.
> >>>
> >>> This commit replaces all uses of PyUnicode_Decode with calls to
> >>> host_string_to_python_string instead.
> >>>
> >>> To make the use of host_string_to_python_string easier, I've added a
> >>> couple of overloads for this function in python-internal.h, these all
> >>> just forward their calls onto the single base implementation.  The
> >>> signatures of all host_string_to_python_string overloads are:
> >>>
> >>>   gdbpy_ref<> host_string_to_python_string (const char *str, size_t len);
> >>>   gdbpy_ref<> host_string_to_python_string (const char *str)
> >>>   gdbpy_ref<> host_string_to_python_string (const string_file &str)
> >>>
> >>> For Python 3 PyString_Decode is setup (in python-internal.h) as an
> >>> alias for PyUnicode_Decode, so there should certainly be no user
> >>> visible changes in this case.
> >>>
> >>> For Python 2 this commit will change the behaviour.  Previously by
> >>> calling PyUnicode_Decode we would have been returning a Unicode
> >>> object.  Now, after calling host_string_to_python_string, we will have
> >>> a str object.
> >>>
> >>> I've checked the GDB documentation, and, as far as I can tell, the
> >>> methods I've touched were all documented as returning a string, or in
> >>> the gdb.Command case, take a string as an argument, and my
> >>> understanding is that for Python 2, string generally means str.  So I
> >>> think the new behaviour would be more expected.
> >>>
> >>> A different solution, that would also make things more consistent in
> >>> the Python 2 world, would be to have host_string_to_python_string
> >>> always return a Unicode object.  However, I've been reading this page:
> >>>
> >>>   https://pythonhosted.org/kitchen/unicode-frustrations.html
> >>>
> >>> item #3 recommends that unicode strings should be converted to str
> >>> objects before being printed (in Python 2).  That would mean that
> >>> users should have been adding .encode() calls to the output of the
> >>> routines I've changed in this commit (if they wanted to print that
> >>> output), which is not something I think is made clear from the GDB
> >>> documentation.
> >>
> >> I don't think we can (nor want) to change things currently returning
> >> unicode objects to return str objects.  It changes significantly what
> >> the user code receives.  If a given method/function has been returning a
> >> unicode object since the dawn of time and we now return a string object,
> >> user code expecting a unicode object will break.
> > 
> > ACK.  I think I disagree about the significance of the change, but I
> > understand your concerns.
> 
> I fail to see how you don't find it significant: this is a public API,
> we change the type of the returned value, doesn't it break existing
> users?

I have two reasons, (1) the behaviour doesn't match the documented
API, which I think gives some scope for fixing either the
implementation or the documentation, and (2) in reality, for many
purposes, a unicode can be used just like a str, e.g. I guess most
folk are not calling encode before printing the unicode object, they
just pass them to print.

Please do remember though, I say the above just to give you insight to
my thinking, I've already ACK'd the above as a blocker for the patch I
proposed.

>
> >> Also, changing a PyUnicode_Decode call to host_string_to_python_string
> >> (and therefore PyString_Decode) means that if the string contains
> >> something not encodable in ascii, things will now fail  So that sounds
> >> like a regression to me: where we could handle UTF-8 strings before, we
> >> don't now.  See below for examples.
> > 
> > Yeah, you're absolutely correct.  I don't understand why we would want
> > to call PyString_Decode here for Py2 at all, maybe you understand why
> > that's a good thing?
> > 
> > If I rewrite the function like this:
> > 
> >   /* Convert a host string to a python string.  */
> > 
> >   gdbpy_ref<>
> >   host_string_to_python_string (const char *str, size_t len)
> >   {
> >   #ifdef IS_PY3K
> >     return gdbpy_ref<> (PyString_Decode (str, len, host_charset (), nullptr));
> >   #else
> >     return gdbpy_ref<> (PyString_FromStringAndSize(str, len));
> >   #endif
> >   }
> > 
> > Then I think this would achieve everything I actually intended this
> > patch to achieve.
> 
> Can you clarify what the patch is trying to achieve?  I don't think I
> understand properly.

I started working on a slightly related issue, and as I started
digging around I was running into places where the API docs said
function XXX returns a string, which isn't the case on Py2.  That got
me looking the places where we called PyUnicode_Decode directly rather
than passing through host_string_to_python_string, and, well, here we
are...

> 
> Given that Python 2's string type is equivalent to Python 3's bytes
> type, and Python 2's unicode type is equivalent to Python 3's string
> type, I find it odd to have host_string_to_python_string (and some
> Python API functions) return a "string" both in Python 2 and Python 3,
> as they are both fundamentally different types.

I understand you position.

>                                                   Returning a unicode in
> Python 2 and a string in Python 3 makes sense to me, they are basically
> the same  thing.

I thought in Py3 unicode and str were the same, so its less
"basically the same thing", and more "are the same thing", right?

> 
> > Obviously, this still doesn't address your concern
> > about the unicode -> str change (for Py2), so I doubt you'd now find
> > the patch acceptable.
> > 
> > That said, I suspect changing host_string_to_python_string as in the
> > above would probably be a good thing, right?  The above function is
> > used, so all I need to do is inject some non ASCII characters into a
> > code path that calls the above function and the existing GDB will
> > break, but the above change would allow things to work correctly.
> 
> Code paths that do use this function already get a "str" in both Python
> 2 and 3 (which I think is wrong, as explained above, but that's what we
> have to deal with) and would still receive a "str" after, so the change
> is is safe from that point of view.

I understand why, given the view that host_string_to_python_string is
basically wrong, adding any additional calls to it would be considered
wrong.  Maybe we should rename it to
deprecated_host_string_to_python_string, and add a new function,
host_string_to_python_unicode.

If/when Py2 support is dropped then user of the old function could be
changed to use the new *_unicode function?

>                                       Today, a non-ASCII character in the
> input argument just causes an exception with Python 2.  With the change
> you propose, it would return an "str" containing whatever bytes came in,
> in the host system's encoding.  So, the new behavior seems like an
> improvement at first sight.

OK, thanks for the feedback.  I see what I can put together in the new
year.

> 
> >> So my suggestion is to leave things as-is, avoid changing the behavior
> >> for Python 2 until we finally remove Python 2 support (for which anybody
> >> willing to do it has my automatic +1).
> > 
> > I certainly wouldn't want to stop someone removing Py2 support.
> 
> I'll take a look at that, many people have suggested it, and this is an
> example of how keeping Python 2 support is a burden.
> 
> >>> ---
> >>>  gdb/python/py-cmd.c          |  9 +++------
> >>>  gdb/python/py-frame.c        |  5 ++---
> >>>  gdb/python/py-type.c         |  3 +--
> >>>  gdb/python/py-utils.c        |  5 ++---
> >>>  gdb/python/py-value.c        |  4 ++--
> >>>  gdb/python/python-internal.h | 11 ++++++++++-
> >>>  gdb/python/python.c          |  4 ++--
> >>>  7 files changed, 22 insertions(+), 19 deletions(-)
> >>>
> >>> diff --git a/gdb/python/py-cmd.c b/gdb/python/py-cmd.c
> >>> index 94608e0bbcf..b2cafeba320 100644
> >>> --- a/gdb/python/py-cmd.c
> >>> +++ b/gdb/python/py-cmd.c
> >>> @@ -120,8 +120,7 @@ cmdpy_function (const char *args, int from_tty, cmd_list_element *command)
> >>>  
> >>>    if (! args)
> >>>      args = "";
> >>> -  gdbpy_ref<> argobj (PyUnicode_Decode (args, strlen (args), host_charset (),
> >>> -					NULL));
> >>> +  gdbpy_ref<> argobj = host_string_to_python_string (args);
> >>>    if (argobj == NULL)
> >>>      {
> >>>        gdbpy_print_stack ();
> >>
> >> Try with this in "script.py", where a user correctly called encode (as
> >> you mentioned in the commit log) before printing args, a 'unicode'
> >> object:
> >>
> >> ---
> >> class MyCmd(gdb.Command):
> >>     def __init__(self):
> >>         super(MyCmd, self).__init__("my-cmd", gdb.COMMAND_USER)
> >>
> >>     def invoke(self, args, from_tty):
> >>         print(args.encode('utf-8'))
> > 
> > Really this should be:
> > 
> >   def invoke(self, args, from_tty):
> >       print(args.encode(gdb.host_charset ()))
> > 
> > Except we don't have a gdb.host_charset method, otherwise it should be
> > possible for this code to go wrong.
> 
> True.  Although somebody could still use .encode('utf-8') and just use
> that script on machines where UTF-8 is the locale (which is just the
> norm today).

I don't understand what you mean here.  If the user is running on a
machine with non utf-8 locale, then (if I understand how this all
works), the bytes read by GDB would be in the machines (host_charset)
local, these bytes would be sent over to Python, which would then
convert them to a unicode object in the host_charset locale.

Now if the user wants to get the bytes back they need to know the
correct value to pass to .encode, right?

> 
> > I'll write a patch to add that.
> > 
> > I assume you'll not object if I propose updating the documentation for
> > all the functions I tried to change here to document the actual
> > behaviour?
> 
> Sure.  Although if we end up removing Python 2 support (which is not a
> given), it might be unnecessary.

Based on the above discussion, shouldn't every API that includes a
unicode object also indicate what the encoding of that unicode object
is?  I mean, sure, users can probably figure it out in most cases,
values from the inferior, target_charset, values from the user,
host_charset, but surely a well documented API should be explicit
about these things?

Thanks,
Andrew


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] gdb/python: make more use of host_string_to_python_string
  2021-12-24 12:20       ` Andrew Burgess
@ 2021-12-27  1:30         ` Simon Marchi
  0 siblings, 0 replies; 8+ messages in thread
From: Simon Marchi @ 2021-12-27  1:30 UTC (permalink / raw)
  To: Andrew Burgess; +Cc: gdb-patches

>> Given that Python 2's string type is equivalent to Python 3's bytes
>> type, and Python 2's unicode type is equivalent to Python 3's string
>> type, I find it odd to have host_string_to_python_string (and some
>> Python API functions) return a "string" both in Python 2 and Python 3,
>> as they are both fundamentally different types.
> 
> I understand you position.
> 
>>                                                   Returning a unicode in
>> Python 2 and a string in Python 3 makes sense to me, they are basically
>> the same  thing.
> 
> I thought in Py3 unicode and str were the same, so its less
> "basically the same thing", and more "are the same thing", right?

Well, the "unicode" type doesn't exist in Py3.  But yes, my
understanding is that Py2's unicode has been renamed to str in Py3, so
we could say "they are the same thing".  But I'm not an expert in Python
internals, so there may be subtle differences I don't know about.  So I
went for the more careful formulation :).


>>> Obviously, this still doesn't address your concern
>>> about the unicode -> str change (for Py2), so I doubt you'd now find
>>> the patch acceptable.
>>>
>>> That said, I suspect changing host_string_to_python_string as in the
>>> above would probably be a good thing, right?  The above function is
>>> used, so all I need to do is inject some non ASCII characters into a
>>> code path that calls the above function and the existing GDB will
>>> break, but the above change would allow things to work correctly.
>>
>> Code paths that do use this function already get a "str" in both Python
>> 2 and 3 (which I think is wrong, as explained above, but that's what we
>> have to deal with) and would still receive a "str" after, so the change
>> is is safe from that point of view.
> 
> I understand why, given the view that host_string_to_python_string is
> basically wrong, adding any additional calls to it would be considered
> wrong.  Maybe we should rename it to
> deprecated_host_string_to_python_string, and add a new function,
> host_string_to_python_unicode.
> 
> If/when Py2 support is dropped then user of the old function could be
> changed to use the new *_unicode function?

That sounds fine.

>>>
>>> Really this should be:
>>>
>>>   def invoke(self, args, from_tty):
>>>       print(args.encode(gdb.host_charset ()))
>>>
>>> Except we don't have a gdb.host_charset method, otherwise it should be
>>> possible for this code to go wrong.
>>
>> True.  Although somebody could still use .encode('utf-8') and just use
>> that script on machines where UTF-8 is the locale (which is just the
>> norm today).
> 
> I don't understand what you mean here.  If the user is running on a
> machine with non utf-8 locale, then (if I understand how this all
> works), the bytes read by GDB would be in the machines (host_charset)
> local, these bytes would be sent over to Python, which would then
> convert them to a unicode object in the host_charset locale.
> 
> Now if the user wants to get the bytes back they need to know the
> correct value to pass to .encode, right?

Yes, you're right.  What I meant is that given that all the machines I
use have an UTF-8 locale, I could use `.encode('utf-8')` in my scripts
and just not care about other charsets.  All of this to say that there
might be scripts out there that care if they receive an str or a
unicode.

>>> I'll write a patch to add that.
>>>
>>> I assume you'll not object if I propose updating the documentation for
>>> all the functions I tried to change here to document the actual
>>> behaviour?
>>
>> Sure.  Although if we end up removing Python 2 support (which is not a
>> given), it might be unnecessary.
> 
> Based on the above discussion, shouldn't every API that includes a
> unicode object also indicate what the encoding of that unicode object
> is?  I mean, sure, users can probably figure it out in most cases,
> values from the inferior, target_charset, values from the user,
> host_charset, but surely a well documented API should be explicit
> about these things?

Not sure.  From the point of view of the user of a unicode object, a
unicode object isn't encoded using some particular encoding.  It's just
a sequence of unicode code points.  The user can then decide to
serialize these code points to the encoding of their choosing by calling
`.encode(...)`, for example 'utf-8' or 'utf-32'.

When returning a Python 2 'str', or if we happened to return text as a
Python 3 'bytes' (which we don't) , then the user just receives a
sequence of bytes.  So then it would be relevant to tell them what
encoding these are in.

Simon

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-12-27  1:30 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-06 18:10 [PATCH] gdb/python: make more use of host_string_to_python_string Andrew Burgess
2021-12-22 15:08 ` PING: " Andrew Burgess
2021-12-23  2:19 ` Simon Marchi
2021-12-23 11:07   ` Andrew Burgess
2021-12-24  3:10     ` Simon Marchi
2021-12-24 12:20       ` Andrew Burgess
2021-12-27  1:30         ` Simon Marchi
2021-12-23 11:35 ` Andrew Burgess

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).