From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by sourceware.org (Postfix) with ESMTPS id E10B93858416 for ; Thu, 23 Dec 2021 11:07:10 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org E10B93858416 Received: from mail-wm1-f72.google.com (mail-wm1-f72.google.com [209.85.128.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-624-X0DpKE1PNPKu_dzH0pqd3A-1; Thu, 23 Dec 2021 06:07:07 -0500 X-MC-Unique: X0DpKE1PNPKu_dzH0pqd3A-1 Received: by mail-wm1-f72.google.com with SMTP id a69-20020a1c7f48000000b00345d3d135caso699837wmd.7 for ; Thu, 23 Dec 2021 03:07:07 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to; bh=SFHDv4jQw+MzYGm0dN5XSfb0GDBYNjOxyLei9xWl2gg=; b=39EVGs6JqRrKS/aKUDioj/aUbq3g9GxON66nnyls7STY2lh+E3i0n2WytufEn7wUFM 9jCkp+n2LhQozZIXzl5MmxSbSgOI+D6Xj0sM7EIC5N1r4abC+77HJ6YfKjoVME//T0x7 9OBog9dTCZ2Uhe2Zh8NGTMayK148yVpK/3jMlTFOR7peY+jYDjXs+mvsBhYTcpuKTA4c j00E1WFJB1DoNLqg5VVqJMrXbsfoFbb+GS0RjcELcu/6/l5RsA1rCpmJ9OrsG1V9zF2t NEqAkHtcrfp3SOOuX+Rk1RC6r2ylLdIGOldmXVZt8tNkY3E2d+3i129qBuzlmjv5o3YY hOJA== X-Gm-Message-State: AOAM532Ixhs0RJIrMkMQvKTyVz/sz35u+MmpI0+v488R7ce2MNw1vs89 /dR8lsRGr9PBOMzZA+kJiFLZ3kurjks9B5FoqmlQ1Wn268ZfXTaTIZcfgFAA/xYd+x4+LFuM5mL RvEhdwSdvgIWZTFYGUtUhfw== X-Received: by 2002:a05:6000:1aca:: with SMTP id i10mr1369708wry.453.1640257626089; Thu, 23 Dec 2021 03:07:06 -0800 (PST) X-Google-Smtp-Source: ABdhPJxVpXcqOuemZqWlQkZSd6CzQutUaWbYX51EaQ9gjOvVm5FK6pRV5vMiPZteSlt4gOS3sVVDKQ== X-Received: by 2002:a05:6000:1aca:: with SMTP id i10mr1369686wry.453.1640257625696; Thu, 23 Dec 2021 03:07:05 -0800 (PST) Received: from localhost (host86-134-238-138.range86-134.btcentralplus.com. [86.134.238.138]) by smtp.gmail.com with ESMTPSA id b16sm4893852wmq.41.2021.12.23.03.07.04 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 23 Dec 2021 03:07:05 -0800 (PST) Date: Thu, 23 Dec 2021 11:07:03 +0000 From: Andrew Burgess To: Simon Marchi Cc: gdb-patches@sourceware.org Subject: Re: [PATCH] gdb/python: make more use of host_string_to_python_string Message-ID: <20211223110703.GI2634@redhat.com> References: <20211206181030.2463266-1-aburgess@redhat.com> <08ccc2c2-7f0b-e962-f2d5-1cc1ce49530d@polymtl.ca> MIME-Version: 1.0 In-Reply-To: <08ccc2c2-7f0b-e962-f2d5-1cc1ce49530d@polymtl.ca> X-Operating-System: Linux/5.8.18-100.fc31.x86_64 (x86_64) X-Uptime: 10:56:34 up 5 days, 44 min, X-Editor: GNU Emacs [ http://www.gnu.org/software/emacs ] X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-4.8 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, RCVD_IN_BARRACUDACENTRAL, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPAM_URI, SPF_HELO_NONE, SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gdb-patches@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gdb-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 23 Dec 2021 11:07:13 -0000 * Simon Marchi [2021-12-22 21:19:41 -0500]: >=20 >=20 > On 2021-12-06 13:10, Andrew Burgess via Gdb-patches wrote: > > We have a function host_string_to_python_string, which is a wrapper > > around PyString_Decode, which, on Python 3, is an alias for > > PyUnicode_Decode. > >=20 > > However, there are a few places where we still call PyUnicode_Decode > > directly. > >=20 > > This commit replaces all uses of PyUnicode_Decode with calls to > > host_string_to_python_string instead. > >=20 > > To make the use of host_string_to_python_string easier, I've added a > > couple of overloads for this function in python-internal.h, these all > > just forward their calls onto the single base implementation. The > > signatures of all host_string_to_python_string overloads are: > >=20 > > gdbpy_ref<> host_string_to_python_string (const char *str, size_t len= ); > > gdbpy_ref<> host_string_to_python_string (const char *str) > > gdbpy_ref<> host_string_to_python_string (const string_file &str) > >=20 > > For Python 3 PyString_Decode is setup (in python-internal.h) as an > > alias for PyUnicode_Decode, so there should certainly be no user > > visible changes in this case. > >=20 > > For Python 2 this commit will change the behaviour. Previously by > > calling PyUnicode_Decode we would have been returning a Unicode > > object. Now, after calling host_string_to_python_string, we will have > > a str object. > >=20 > > I've checked the GDB documentation, and, as far as I can tell, the > > methods I've touched were all documented as returning a string, or in > > the gdb.Command case, take a string as an argument, and my > > understanding is that for Python 2, string generally means str. So I > > think the new behaviour would be more expected. > >=20 > > A different solution, that would also make things more consistent in > > the Python 2 world, would be to have host_string_to_python_string > > always return a Unicode object. However, I've been reading this page: > >=20 > > https://pythonhosted.org/kitchen/unicode-frustrations.html > >=20 > > item #3 recommends that unicode strings should be converted to str > > objects before being printed (in Python 2). That would mean that > > users should have been adding .encode() calls to the output of the > > routines I've changed in this commit (if they wanted to print that > > output), which is not something I think is made clear from the GDB > > documentation. >=20 > I don't think we can (nor want) to change things currently returning > unicode objects to return str objects. It changes significantly what > the user code receives. If a given method/function has been returning a > unicode object since the dawn of time and we now return a string object, > user code expecting a unicode object will break. ACK. I think I disagree about the significance of the change, but I understand your concerns. >=20 > Also, changing a PyUnicode_Decode call to host_string_to_python_string > (and therefore PyString_Decode) means that if the string contains > something not encodable in ascii, things will now fail So that sounds > like a regression to me: where we could handle UTF-8 strings before, we > don't now. See below for examples. Yeah, you're absolutely correct. I don't understand why we would want to call PyString_Decode here for Py2 at all, maybe you understand why that's a good thing? If I rewrite the function like this: /* Convert a host string to a python string. */ gdbpy_ref<> host_string_to_python_string (const char *str, size_t len) { #ifdef IS_PY3K return gdbpy_ref<> (PyString_Decode (str, len, host_charset (), nullptr= )); #else return gdbpy_ref<> (PyString_FromStringAndSize(str, len)); #endif } Then I think this would achieve everything I actually intended this patch to achieve. Obviously, this still doesn't address your concern about the unicode -> str change (for Py2), so I doubt you'd now find the patch acceptable. That said, I suspect changing host_string_to_python_string as in the above would probably be a good thing, right? The above function is used, so all I need to do is inject some non ASCII characters into a code path that calls the above function and the existing GDB will break, but the above change would allow things to work correctly. >=20 > So my suggestion is to leave things as-is, avoid changing the behavior > for Python 2 until we finally remove Python 2 support (for which anybody > willing to do it has my automatic +1). I certainly wouldn't want to stop someone removing Py2 support. >=20 > > --- > > gdb/python/py-cmd.c | 9 +++------ > > gdb/python/py-frame.c | 5 ++--- > > gdb/python/py-type.c | 3 +-- > > gdb/python/py-utils.c | 5 ++--- > > gdb/python/py-value.c | 4 ++-- > > gdb/python/python-internal.h | 11 ++++++++++- > > gdb/python/python.c | 4 ++-- > > 7 files changed, 22 insertions(+), 19 deletions(-) > >=20 > > diff --git a/gdb/python/py-cmd.c b/gdb/python/py-cmd.c > > index 94608e0bbcf..b2cafeba320 100644 > > --- a/gdb/python/py-cmd.c > > +++ b/gdb/python/py-cmd.c > > @@ -120,8 +120,7 @@ cmdpy_function (const char *args, int from_tty, cmd= _list_element *command) > > =20 > > if (! args) > > args =3D ""; > > - gdbpy_ref<> argobj (PyUnicode_Decode (args, strlen (args), host_char= set (), > > -=09=09=09=09=09NULL)); > > + gdbpy_ref<> argobj =3D host_string_to_python_string (args); > > if (argobj =3D=3D NULL) > > { > > gdbpy_print_stack (); >=20 > Try with this in "script.py", where a user correctly called encode (as > you mentioned in the commit log) before printing args, a 'unicode' > object: >=20 > --- > class MyCmd(gdb.Command): > def __init__(self): > super(MyCmd, self).__init__("my-cmd", gdb.COMMAND_USER) >=20 > def invoke(self, args, from_tty): > print(args.encode('utf-8')) Really this should be: def invoke(self, args, from_tty): print(args.encode(gdb.host_charset ())) Except we don't have a gdb.host_charset method, otherwise it should be possible for this code to go wrong. I'll write a patch to add that. I assume you'll not object if I propose updating the documentation for all the functions I tried to change here to document the actual behaviour? Thanks for your in depth analysis. Andrew >=20 > MyCmd() > --- >=20 > $ ./gdb-old -q -nx --data-directory=3Ddata-directory -x script.py -ex 'my= -cmd l=E9l' -batch > l=E9l > $ ./gdb-new -q -nx --data-directory=3Ddata-directory -x script.py -ex 'my= -cmd l=E9l' -batch > Python Exception : 'ascii' codec ca= n't encode character u'\xe9' in position 1: ordinal not in range(128) > Could not convert arguments to Python string. >=20 > > @@ -181,8 +180,7 @@ cmdpy_completer_helper (struct cmd_list_element *co= mmand, > > return NULL; > > } > > =20 > > - gdbpy_ref<> textobj (PyUnicode_Decode (text, strlen (text), host_cha= rset (), > > -=09=09=09=09=09 NULL)); > > + gdbpy_ref<> textobj =3D host_string_to_python_string (text); > > if (textobj =3D=3D NULL) > > error (_("Could not convert argument to Python string.")); > > =20 > > @@ -194,8 +192,7 @@ cmdpy_completer_helper (struct cmd_list_element *co= mmand, > > } > > else > > { > > - wordobj.reset (PyUnicode_Decode (word, strlen (word), host_chars= et (), > > -=09=09=09=09 NULL)); > > + wordobj =3D host_string_to_python_string (word); > > if (wordobj =3D=3D NULL) > > =09error (_("Could not convert argument to Python string.")); > > } > > diff --git a/gdb/python/py-frame.c b/gdb/python/py-frame.c > > index ee57eb10576..b507ff0794f 100644 > > --- a/gdb/python/py-frame.c > > +++ b/gdb/python/py-frame.c > > @@ -131,8 +131,7 @@ frapy_name (PyObject *self, PyObject *args) > > =20 > > if (name) > > { > > - result =3D PyUnicode_Decode (name.get (), strlen (name.get ()), > > -=09=09=09=09 host_charset (), NULL); > > + result =3D host_string_to_python_string (name.get ()).release ()= ; >=20 > Same here, you can have a frame with a non-ascii name, that would now > break. >=20 > > } > > else > > { > > @@ -658,7 +657,7 @@ gdbpy_frame_stop_reason_string (PyObject *self, PyO= bject *args) > > } > > =20 > > str =3D unwind_stop_reason_to_string ((enum unwind_stop_reason) reas= on); > > - return PyUnicode_Decode (str, strlen (str), host_charset (), NULL); > > + return host_string_to_python_string (str).release (); >=20 > Stop reasons in english probably don't contain non-ascii characters, but > they are translated, so could, in theory, contain non-ascii characters. > So that would break here. >=20 >=20 > =09$ ./gdb-old -nx -q --data-directory=3Ddata-directory a.out -ex start -= ex s -ex 'python print(gdb.selected_frame().name().encode("utf-8"))' > =09Reading symbols from a.out... > =09Temporary breakpoint 1 at 0x1124: file test.cpp, line 7. > =09Starting program: /home/simark/build/binutils-gdb-python2/gdb/a.out=20 >=20 > =09Temporary breakpoint 1, main () at test.cpp:7 > =097 l=E9l(); > =09l=E9l () at test.cpp:3 > =093 } > =09l=E9l > =09(gdb)=20 >=20 >=20 > =09$ ./gdb-new -nx -q --data-directory=3Ddata-directory a.out -ex start -= ex s -ex 'python print(gdb.selected_frame().name().encode("utf-8"))' > =09Reading symbols from a.out... > =09Temporary breakpoint 1 at 0x1124: file test.cpp, line 7. > =09Starting program: /home/simark/build/binutils-gdb-python2/gdb/a.out=20 >=20 > =09Temporary breakpoint 1, main () at test.cpp:7 > =097 l=E9l(); > =09l=E9l () at test.cpp:3 > =093 } > =09Traceback (most recent call last): > =09 File "", line 1, in > =09UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in po= sition 1: ordinal not in range(128) > =09Error while executing Python code. >=20 > > } > > =20 > > /* Implements the equality comparison for Frame objects. > > diff --git a/gdb/python/py-type.c b/gdb/python/py-type.c > > index 8b17b70fbe3..a178c6a4ab2 100644 > > --- a/gdb/python/py-type.c > > +++ b/gdb/python/py-type.c > > @@ -1033,8 +1033,7 @@ typy_str (PyObject *self) > > GDB_PY_HANDLE_EXCEPTION (except); > > } > > =20 > > - return PyUnicode_Decode (thetype.c_str (), thetype.size (), > > -=09=09=09 host_charset (), NULL); > > + return host_string_to_python_string (thetype).release (); >=20 > Here, if the type name contains some non ascii characters, old GDB > would still break, as Python will try to call str() on the unicode > object, which (I guess) will try to encode it as ascii. With new GDB, > the same will happen, but earlier, inside host_string_to_python_string. > If somebody called Type.__str__ directly, they would see the difference > though. >=20 > > } > > =20 > > /* Implement the richcompare method. */ > > diff --git a/gdb/python/py-utils.c b/gdb/python/py-utils.c > > index 10c4173efcd..2eb1ed2a09e 100644 > > --- a/gdb/python/py-utils.c > > +++ b/gdb/python/py-utils.c > > @@ -152,10 +152,9 @@ python_string_to_host_string (PyObject *obj) > > /* Convert a host string to a python string. */ > > =20 > > gdbpy_ref<> > > -host_string_to_python_string (const char *str) > > +host_string_to_python_string (const char *str, size_t len) > > { > > - return gdbpy_ref<> (PyString_Decode (str, strlen (str), host_charset= (), > > -=09=09=09=09 NULL)); > > + return gdbpy_ref<> (PyString_Decode (str, len, host_charset (), null= ptr)); > > } > > =20 > > /* Return true if OBJ is a Python string or unicode object, false > > diff --git a/gdb/python/py-value.c b/gdb/python/py-value.c > > index c843c2c3072..8bd30729454 100644 > > --- a/gdb/python/py-value.c > > +++ b/gdb/python/py-value.c > > @@ -743,7 +743,7 @@ valpy_format_string (PyObject *self, PyObject *args= , PyObject *kw) > > GDB_PY_HANDLE_EXCEPTION (except); > > } > > =20 > > - return PyUnicode_Decode (stb.c_str (), stb.size (), host_charset (),= NULL); > > + return host_string_to_python_string (stb).release (); >=20 > I guess that a value here could easily be a string with non-ascii > characters. Also, note what the comment of the function says: >=20 > /* Implementation of gdb.Value.format_string (...) -> string. > Return Unicode string with value contents formatted using the > keyword-only arguments. */ >=20 >=20 > Simon >=20