Re: [PATCH] gdb/python: make more use of host_string_to_python_string

public inbox for gdb-patches@sourceware.org
 help / color / mirror / Atom feed

From: Simon Marchi <simon.marchi@polymtl.ca>
To: Andrew Burgess <aburgess@redhat.com>
Cc: gdb-patches@sourceware.org
Subject: Re: [PATCH] gdb/python: make more use of host_string_to_python_string
Date: Thu, 23 Dec 2021 22:10:00 -0500	[thread overview]
Message-ID: <8623b32d-697d-9b71-cf07-f073a44de1aa@polymtl.ca> (raw)
In-Reply-To: <20211223110703.GI2634@redhat.com>



On 2021-12-23 06:07, Andrew Burgess wrote:
> * Simon Marchi <simon.marchi@polymtl.ca> [2021-12-22 21:19:41 -0500]:
> 
>>
>>
>> On 2021-12-06 13:10, Andrew Burgess via Gdb-patches wrote:
>>> We have a function host_string_to_python_string, which is a wrapper
>>> around PyString_Decode, which, on Python 3, is an alias for
>>> PyUnicode_Decode.
>>>
>>> However, there are a few places where we still call PyUnicode_Decode
>>> directly.
>>>
>>> This commit replaces all uses of PyUnicode_Decode with calls to
>>> host_string_to_python_string instead.
>>>
>>> To make the use of host_string_to_python_string easier, I've added a
>>> couple of overloads for this function in python-internal.h, these all
>>> just forward their calls onto the single base implementation.  The
>>> signatures of all host_string_to_python_string overloads are:
>>>
>>>   gdbpy_ref<> host_string_to_python_string (const char *str, size_t len);
>>>   gdbpy_ref<> host_string_to_python_string (const char *str)
>>>   gdbpy_ref<> host_string_to_python_string (const string_file &str)
>>>
>>> For Python 3 PyString_Decode is setup (in python-internal.h) as an
>>> alias for PyUnicode_Decode, so there should certainly be no user
>>> visible changes in this case.
>>>
>>> For Python 2 this commit will change the behaviour.  Previously by
>>> calling PyUnicode_Decode we would have been returning a Unicode
>>> object.  Now, after calling host_string_to_python_string, we will have
>>> a str object.
>>>
>>> I've checked the GDB documentation, and, as far as I can tell, the
>>> methods I've touched were all documented as returning a string, or in
>>> the gdb.Command case, take a string as an argument, and my
>>> understanding is that for Python 2, string generally means str.  So I
>>> think the new behaviour would be more expected.
>>>
>>> A different solution, that would also make things more consistent in
>>> the Python 2 world, would be to have host_string_to_python_string
>>> always return a Unicode object.  However, I've been reading this page:
>>>
>>>   https://pythonhosted.org/kitchen/unicode-frustrations.html
>>>
>>> item #3 recommends that unicode strings should be converted to str
>>> objects before being printed (in Python 2).  That would mean that
>>> users should have been adding .encode() calls to the output of the
>>> routines I've changed in this commit (if they wanted to print that
>>> output), which is not something I think is made clear from the GDB
>>> documentation.
>>
>> I don't think we can (nor want) to change things currently returning
>> unicode objects to return str objects.  It changes significantly what
>> the user code receives.  If a given method/function has been returning a
>> unicode object since the dawn of time and we now return a string object,
>> user code expecting a unicode object will break.
> 
> ACK.  I think I disagree about the significance of the change, but I
> understand your concerns.

I fail to see how you don't find it significant: this is a public API,
we change the type of the returned value, doesn't it break existing
users?

>> Also, changing a PyUnicode_Decode call to host_string_to_python_string
>> (and therefore PyString_Decode) means that if the string contains
>> something not encodable in ascii, things will now fail  So that sounds
>> like a regression to me: where we could handle UTF-8 strings before, we
>> don't now.  See below for examples.
> 
> Yeah, you're absolutely correct.  I don't understand why we would want
> to call PyString_Decode here for Py2 at all, maybe you understand why
> that's a good thing?
> 
> If I rewrite the function like this:
> 
>   /* Convert a host string to a python string.  */
> 
>   gdbpy_ref<>
>   host_string_to_python_string (const char *str, size_t len)
>   {
>   #ifdef IS_PY3K
>     return gdbpy_ref<> (PyString_Decode (str, len, host_charset (), nullptr));
>   #else
>     return gdbpy_ref<> (PyString_FromStringAndSize(str, len));
>   #endif
>   }
> 
> Then I think this would achieve everything I actually intended this
> patch to achieve.

Can you clarify what the patch is trying to achieve?  I don't think I
understand properly.

Given that Python 2's string type is equivalent to Python 3's bytes
type, and Python 2's unicode type is equivalent to Python 3's string
type, I find it odd to have host_string_to_python_string (and some
Python API functions) return a "string" both in Python 2 and Python 3,
as they are both fundamentally different types.  Returning a unicode in
Python 2 and a string in Python 3 makes sense to me, they are basically
the same  thing.

> Obviously, this still doesn't address your concern
> about the unicode -> str change (for Py2), so I doubt you'd now find
> the patch acceptable.
> 
> That said, I suspect changing host_string_to_python_string as in the
> above would probably be a good thing, right?  The above function is
> used, so all I need to do is inject some non ASCII characters into a
> code path that calls the above function and the existing GDB will
> break, but the above change would allow things to work correctly.

Code paths that do use this function already get a "str" in both Python
2 and 3 (which I think is wrong, as explained above, but that's what we
have to deal with) and would still receive a "str" after, so the change
is is safe from that point of view.  Today, a non-ASCII character in the
input argument just causes an exception with Python 2.  With the change
you propose, it would return an "str" containing whatever bytes came in,
in the host system's encoding.  So, the new behavior seems like an
improvement at first sight.

>> So my suggestion is to leave things as-is, avoid changing the behavior
>> for Python 2 until we finally remove Python 2 support (for which anybody
>> willing to do it has my automatic +1).
> 
> I certainly wouldn't want to stop someone removing Py2 support.

I'll take a look at that, many people have suggested it, and this is an
example of how keeping Python 2 support is a burden.

>>> ---
>>>  gdb/python/py-cmd.c          |  9 +++------
>>>  gdb/python/py-frame.c        |  5 ++---
>>>  gdb/python/py-type.c         |  3 +--
>>>  gdb/python/py-utils.c        |  5 ++---
>>>  gdb/python/py-value.c        |  4 ++--
>>>  gdb/python/python-internal.h | 11 ++++++++++-
>>>  gdb/python/python.c          |  4 ++--
>>>  7 files changed, 22 insertions(+), 19 deletions(-)
>>>
>>> diff --git a/gdb/python/py-cmd.c b/gdb/python/py-cmd.c
>>> index 94608e0bbcf..b2cafeba320 100644
>>> --- a/gdb/python/py-cmd.c
>>> +++ b/gdb/python/py-cmd.c
>>> @@ -120,8 +120,7 @@ cmdpy_function (const char *args, int from_tty, cmd_list_element *command)
>>>  
>>>    if (! args)
>>>      args = "";
>>> -  gdbpy_ref<> argobj (PyUnicode_Decode (args, strlen (args), host_charset (),
>>> -					NULL));
>>> +  gdbpy_ref<> argobj = host_string_to_python_string (args);
>>>    if (argobj == NULL)
>>>      {
>>>        gdbpy_print_stack ();
>>
>> Try with this in "script.py", where a user correctly called encode (as
>> you mentioned in the commit log) before printing args, a 'unicode'
>> object:
>>
>> ---
>> class MyCmd(gdb.Command):
>>     def __init__(self):
>>         super(MyCmd, self).__init__("my-cmd", gdb.COMMAND_USER)
>>
>>     def invoke(self, args, from_tty):
>>         print(args.encode('utf-8'))
> 
> Really this should be:
> 
>   def invoke(self, args, from_tty):
>       print(args.encode(gdb.host_charset ()))
> 
> Except we don't have a gdb.host_charset method, otherwise it should be
> possible for this code to go wrong.

True.  Although somebody could still use .encode('utf-8') and just use
that script on machines where UTF-8 is the locale (which is just the
norm today).

> I'll write a patch to add that.
> 
> I assume you'll not object if I propose updating the documentation for
> all the functions I tried to change here to document the actual
> behaviour?

Sure.  Although if we end up removing Python 2 support (which is not a
given), it might be unnecessary.

Simon

next prev parent reply	other threads:[~2021-12-24  3:10 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-06 18:10 Andrew Burgess
2021-12-22 15:08 ` PING: " Andrew Burgess
2021-12-23  2:19 ` Simon Marchi
2021-12-23 11:07   ` Andrew Burgess
2021-12-24  3:10     ` Simon Marchi [this message]
2021-12-24 12:20       ` Andrew Burgess
2021-12-27  1:30         ` Simon Marchi
2021-12-23 11:35 ` Andrew Burgess

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8623b32d-697d-9b71-cf07-f073a44de1aa@polymtl.ca \
    --to=simon.marchi@polymtl.ca \
    --cc=aburgess@redhat.com \
    --cc=gdb-patches@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).