From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <aburgess@redhat.com>
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
 by sourceware.org (Postfix) with ESMTPS id E10B93858416
 for <gdb-patches@sourceware.org>; Thu, 23 Dec 2021 11:07:10 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org E10B93858416
Received: from mail-wm1-f72.google.com (mail-wm1-f72.google.com
 [209.85.128.72]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-624-X0DpKE1PNPKu_dzH0pqd3A-1; Thu, 23 Dec 2021 06:07:07 -0500
X-MC-Unique: X0DpKE1PNPKu_dzH0pqd3A-1
Received: by mail-wm1-f72.google.com with SMTP id
 a69-20020a1c7f48000000b00345d3d135caso699837wmd.7
 for <gdb-patches@sourceware.org>; Thu, 23 Dec 2021 03:07:07 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:date:from:to:cc:subject:message-id:references
 :mime-version:content-disposition:content-transfer-encoding
 :in-reply-to;
 bh=SFHDv4jQw+MzYGm0dN5XSfb0GDBYNjOxyLei9xWl2gg=;
 b=39EVGs6JqRrKS/aKUDioj/aUbq3g9GxON66nnyls7STY2lh+E3i0n2WytufEn7wUFM
 9jCkp+n2LhQozZIXzl5MmxSbSgOI+D6Xj0sM7EIC5N1r4abC+77HJ6YfKjoVME//T0x7
 9OBog9dTCZ2Uhe2Zh8NGTMayK148yVpK/3jMlTFOR7peY+jYDjXs+mvsBhYTcpuKTA4c
 j00E1WFJB1DoNLqg5VVqJMrXbsfoFbb+GS0RjcELcu/6/l5RsA1rCpmJ9OrsG1V9zF2t
 NEqAkHtcrfp3SOOuX+Rk1RC6r2ylLdIGOldmXVZt8tNkY3E2d+3i129qBuzlmjv5o3YY
 hOJA==
X-Gm-Message-State: AOAM532Ixhs0RJIrMkMQvKTyVz/sz35u+MmpI0+v488R7ce2MNw1vs89
 /dR8lsRGr9PBOMzZA+kJiFLZ3kurjks9B5FoqmlQ1Wn268ZfXTaTIZcfgFAA/xYd+x4+LFuM5mL
 RvEhdwSdvgIWZTFYGUtUhfw==
X-Received: by 2002:a05:6000:1aca:: with SMTP id
 i10mr1369708wry.453.1640257626089; 
 Thu, 23 Dec 2021 03:07:06 -0800 (PST)
X-Google-Smtp-Source: ABdhPJxVpXcqOuemZqWlQkZSd6CzQutUaWbYX51EaQ9gjOvVm5FK6pRV5vMiPZteSlt4gOS3sVVDKQ==
X-Received: by 2002:a05:6000:1aca:: with SMTP id
 i10mr1369686wry.453.1640257625696; 
 Thu, 23 Dec 2021 03:07:05 -0800 (PST)
Received: from localhost (host86-134-238-138.range86-134.btcentralplus.com.
 [86.134.238.138])
 by smtp.gmail.com with ESMTPSA id b16sm4893852wmq.41.2021.12.23.03.07.04
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Thu, 23 Dec 2021 03:07:05 -0800 (PST)
Date: Thu, 23 Dec 2021 11:07:03 +0000
From: Andrew Burgess <aburgess@redhat.com>
To: Simon Marchi <simon.marchi@polymtl.ca>
Cc: gdb-patches@sourceware.org
Subject: Re: [PATCH] gdb/python: make more use of host_string_to_python_string
Message-ID: <20211223110703.GI2634@redhat.com>
References: <20211206181030.2463266-1-aburgess@redhat.com>
 <08ccc2c2-7f0b-e962-f2d5-1cc1ce49530d@polymtl.ca>
MIME-Version: 1.0
In-Reply-To: <08ccc2c2-7f0b-e962-f2d5-1cc1ce49530d@polymtl.ca>
X-Operating-System: Linux/5.8.18-100.fc31.x86_64 (x86_64)
X-Uptime: 10:56:34 up 5 days, 44 min,
X-Editor: GNU Emacs [ http://www.gnu.org/software/emacs ]
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-4.8 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH,
 DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0,
 RCVD_IN_BARRACUDACENTRAL, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H3,
 RCVD_IN_MSPIKE_WL, SPAM_URI, SPF_HELO_NONE, SPF_NONE,
 TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gdb-patches@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gdb-patches mailing list <gdb-patches.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/gdb-patches>,
 <mailto:gdb-patches-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/gdb-patches/>
List-Post: <mailto:gdb-patches@sourceware.org>
List-Help: <mailto:gdb-patches-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/gdb-patches>,
 <mailto:gdb-patches-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Thu, 23 Dec 2021 11:07:13 -0000

* Simon Marchi <simon.marchi@polymtl.ca> [2021-12-22 21:19:41 -0500]:

>=20
>=20
> On 2021-12-06 13:10, Andrew Burgess via Gdb-patches wrote:
> > We have a function host_string_to_python_string, which is a wrapper
> > around PyString_Decode, which, on Python 3, is an alias for
> > PyUnicode_Decode.
> >=20
> > However, there are a few places where we still call PyUnicode_Decode
> > directly.
> >=20
> > This commit replaces all uses of PyUnicode_Decode with calls to
> > host_string_to_python_string instead.
> >=20
> > To make the use of host_string_to_python_string easier, I've added a
> > couple of overloads for this function in python-internal.h, these all
> > just forward their calls onto the single base implementation.  The
> > signatures of all host_string_to_python_string overloads are:
> >=20
> >   gdbpy_ref<> host_string_to_python_string (const char *str, size_t len=
);
> >   gdbpy_ref<> host_string_to_python_string (const char *str)
> >   gdbpy_ref<> host_string_to_python_string (const string_file &str)
> >=20
> > For Python 3 PyString_Decode is setup (in python-internal.h) as an
> > alias for PyUnicode_Decode, so there should certainly be no user
> > visible changes in this case.
> >=20
> > For Python 2 this commit will change the behaviour.  Previously by
> > calling PyUnicode_Decode we would have been returning a Unicode
> > object.  Now, after calling host_string_to_python_string, we will have
> > a str object.
> >=20
> > I've checked the GDB documentation, and, as far as I can tell, the
> > methods I've touched were all documented as returning a string, or in
> > the gdb.Command case, take a string as an argument, and my
> > understanding is that for Python 2, string generally means str.  So I
> > think the new behaviour would be more expected.
> >=20
> > A different solution, that would also make things more consistent in
> > the Python 2 world, would be to have host_string_to_python_string
> > always return a Unicode object.  However, I've been reading this page:
> >=20
> >   https://pythonhosted.org/kitchen/unicode-frustrations.html
> >=20
> > item #3 recommends that unicode strings should be converted to str
> > objects before being printed (in Python 2).  That would mean that
> > users should have been adding .encode() calls to the output of the
> > routines I've changed in this commit (if they wanted to print that
> > output), which is not something I think is made clear from the GDB
> > documentation.
>=20
> I don't think we can (nor want) to change things currently returning
> unicode objects to return str objects.  It changes significantly what
> the user code receives.  If a given method/function has been returning a
> unicode object since the dawn of time and we now return a string object,
> user code expecting a unicode object will break.

ACK.  I think I disagree about the significance of the change, but I
understand your concerns.

>=20
> Also, changing a PyUnicode_Decode call to host_string_to_python_string
> (and therefore PyString_Decode) means that if the string contains
> something not encodable in ascii, things will now fail  So that sounds
> like a regression to me: where we could handle UTF-8 strings before, we
> don't now.  See below for examples.

Yeah, you're absolutely correct.  I don't understand why we would want
to call PyString_Decode here for Py2 at all, maybe you understand why
that's a good thing?

If I rewrite the function like this:

  /* Convert a host string to a python string.  */

  gdbpy_ref<>
  host_string_to_python_string (const char *str, size_t len)
  {
  #ifdef IS_PY3K
    return gdbpy_ref<> (PyString_Decode (str, len, host_charset (), nullptr=
));
  #else
    return gdbpy_ref<> (PyString_FromStringAndSize(str, len));
  #endif
  }

Then I think this would achieve everything I actually intended this
patch to achieve.  Obviously, this still doesn't address your concern
about the unicode -> str change (for Py2), so I doubt you'd now find
the patch acceptable.

That said, I suspect changing host_string_to_python_string as in the
above would probably be a good thing, right?  The above function is
used, so all I need to do is inject some non ASCII characters into a
code path that calls the above function and the existing GDB will
break, but the above change would allow things to work correctly.

>=20
> So my suggestion is to leave things as-is, avoid changing the behavior
> for Python 2 until we finally remove Python 2 support (for which anybody
> willing to do it has my automatic +1).

I certainly wouldn't want to stop someone removing Py2 support.

>=20
> > ---
> >  gdb/python/py-cmd.c          |  9 +++------
> >  gdb/python/py-frame.c        |  5 ++---
> >  gdb/python/py-type.c         |  3 +--
> >  gdb/python/py-utils.c        |  5 ++---
> >  gdb/python/py-value.c        |  4 ++--
> >  gdb/python/python-internal.h | 11 ++++++++++-
> >  gdb/python/python.c          |  4 ++--
> >  7 files changed, 22 insertions(+), 19 deletions(-)
> >=20
> > diff --git a/gdb/python/py-cmd.c b/gdb/python/py-cmd.c
> > index 94608e0bbcf..b2cafeba320 100644
> > --- a/gdb/python/py-cmd.c
> > +++ b/gdb/python/py-cmd.c
> > @@ -120,8 +120,7 @@ cmdpy_function (const char *args, int from_tty, cmd=
_list_element *command)
> > =20
> >    if (! args)
> >      args =3D "";
> > -  gdbpy_ref<> argobj (PyUnicode_Decode (args, strlen (args), host_char=
set (),
> > -=09=09=09=09=09NULL));
> > +  gdbpy_ref<> argobj =3D host_string_to_python_string (args);
> >    if (argobj =3D=3D NULL)
> >      {
> >        gdbpy_print_stack ();
>=20
> Try with this in "script.py", where a user correctly called encode (as
> you mentioned in the commit log) before printing args, a 'unicode'
> object:
>=20
> ---
> class MyCmd(gdb.Command):
>     def __init__(self):
>         super(MyCmd, self).__init__("my-cmd", gdb.COMMAND_USER)
>=20
>     def invoke(self, args, from_tty):
>         print(args.encode('utf-8'))

Really this should be:

  def invoke(self, args, from_tty):
      print(args.encode(gdb.host_charset ()))

Except we don't have a gdb.host_charset method, otherwise it should be
possible for this code to go wrong.  I'll write a patch to add that.

I assume you'll not object if I propose updating the documentation for
all the functions I tried to change here to document the actual
behaviour?

Thanks for your in depth analysis.

Andrew

>=20
> MyCmd()
> ---
>=20
> $ ./gdb-old -q -nx --data-directory=3Ddata-directory -x script.py -ex 'my=
-cmd l=E9l' -batch
> l=E9l
> $ ./gdb-new -q -nx --data-directory=3Ddata-directory -x script.py -ex 'my=
-cmd l=E9l' -batch
> Python Exception <type 'exceptions.UnicodeEncodeError'>: 'ascii' codec ca=
n't encode character u'\xe9' in position 1: ordinal not in range(128)
> Could not convert arguments to Python string.
>=20
> > @@ -181,8 +180,7 @@ cmdpy_completer_helper (struct cmd_list_element *co=
mmand,
> >        return NULL;
> >      }
> > =20
> > -  gdbpy_ref<> textobj (PyUnicode_Decode (text, strlen (text), host_cha=
rset (),
> > -=09=09=09=09=09 NULL));
> > +  gdbpy_ref<> textobj =3D host_string_to_python_string (text);
> >    if (textobj =3D=3D NULL)
> >      error (_("Could not convert argument to Python string."));
> > =20
> > @@ -194,8 +192,7 @@ cmdpy_completer_helper (struct cmd_list_element *co=
mmand,
> >      }
> >    else
> >      {
> > -      wordobj.reset (PyUnicode_Decode (word, strlen (word), host_chars=
et (),
> > -=09=09=09=09       NULL));
> > +      wordobj =3D host_string_to_python_string (word);
> >        if (wordobj =3D=3D NULL)
> >  =09error (_("Could not convert argument to Python string."));
> >      }
> > diff --git a/gdb/python/py-frame.c b/gdb/python/py-frame.c
> > index ee57eb10576..b507ff0794f 100644
> > --- a/gdb/python/py-frame.c
> > +++ b/gdb/python/py-frame.c
> > @@ -131,8 +131,7 @@ frapy_name (PyObject *self, PyObject *args)
> > =20
> >    if (name)
> >      {
> > -      result =3D PyUnicode_Decode (name.get (), strlen (name.get ()),
> > -=09=09=09=09 host_charset (), NULL);
> > +      result =3D host_string_to_python_string (name.get ()).release ()=
;
>=20
> Same here, you can have a frame with a non-ascii name, that would now
> break.
>=20
> >      }
> >    else
> >      {
> > @@ -658,7 +657,7 @@ gdbpy_frame_stop_reason_string (PyObject *self, PyO=
bject *args)
> >      }
> > =20
> >    str =3D unwind_stop_reason_to_string ((enum unwind_stop_reason) reas=
on);
> > -  return PyUnicode_Decode (str, strlen (str), host_charset (), NULL);
> > +  return host_string_to_python_string (str).release ();
>=20
> Stop reasons in english probably don't contain non-ascii characters, but
> they are translated, so could, in theory, contain non-ascii characters.
> So that would break here.
>=20
>=20
> =09$ ./gdb-old -nx -q --data-directory=3Ddata-directory a.out -ex start -=
ex s -ex 'python print(gdb.selected_frame().name().encode("utf-8"))'
> =09Reading symbols from a.out...
> =09Temporary breakpoint 1 at 0x1124: file test.cpp, line 7.
> =09Starting program: /home/simark/build/binutils-gdb-python2/gdb/a.out=20
>=20
> =09Temporary breakpoint 1, main () at test.cpp:7
> =097         l=E9l();
> =09l=E9l () at test.cpp:3
> =093       }
> =09l=E9l
> =09(gdb)=20
>=20
>=20
> =09$ ./gdb-new -nx -q --data-directory=3Ddata-directory a.out -ex start -=
ex s -ex 'python print(gdb.selected_frame().name().encode("utf-8"))'
> =09Reading symbols from a.out...
> =09Temporary breakpoint 1 at 0x1124: file test.cpp, line 7.
> =09Starting program: /home/simark/build/binutils-gdb-python2/gdb/a.out=20
>=20
> =09Temporary breakpoint 1, main () at test.cpp:7
> =097         l=E9l();
> =09l=E9l () at test.cpp:3
> =093       }
> =09Traceback (most recent call last):
> =09  File "<string>", line 1, in <module>
> =09UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in po=
sition 1: ordinal not in range(128)
> =09Error while executing Python code.
>=20
> >  }
> > =20
> >  /* Implements the equality comparison for Frame objects.
> > diff --git a/gdb/python/py-type.c b/gdb/python/py-type.c
> > index 8b17b70fbe3..a178c6a4ab2 100644
> > --- a/gdb/python/py-type.c
> > +++ b/gdb/python/py-type.c
> > @@ -1033,8 +1033,7 @@ typy_str (PyObject *self)
> >        GDB_PY_HANDLE_EXCEPTION (except);
> >      }
> > =20
> > -  return PyUnicode_Decode (thetype.c_str (), thetype.size (),
> > -=09=09=09   host_charset (), NULL);
> > +  return host_string_to_python_string (thetype).release ();
>=20
> Here, if the type name contains some non ascii characters, old GDB
> would still break, as Python will try to call str() on the unicode
> object, which (I guess) will try to encode it as ascii.  With new GDB,
> the same will happen, but earlier, inside host_string_to_python_string.
> If somebody called Type.__str__ directly, they would see the difference
> though.
>=20
> >  }
> > =20
> >  /* Implement the richcompare method.  */
> > diff --git a/gdb/python/py-utils.c b/gdb/python/py-utils.c
> > index 10c4173efcd..2eb1ed2a09e 100644
> > --- a/gdb/python/py-utils.c
> > +++ b/gdb/python/py-utils.c
> > @@ -152,10 +152,9 @@ python_string_to_host_string (PyObject *obj)
> >  /* Convert a host string to a python string.  */
> > =20
> >  gdbpy_ref<>
> > -host_string_to_python_string (const char *str)
> > +host_string_to_python_string (const char *str, size_t len)
> >  {
> > -  return gdbpy_ref<> (PyString_Decode (str, strlen (str), host_charset=
 (),
> > -=09=09=09=09       NULL));
> > +  return gdbpy_ref<> (PyString_Decode (str, len, host_charset (), null=
ptr));
> >  }
> > =20
> >  /* Return true if OBJ is a Python string or unicode object, false
> > diff --git a/gdb/python/py-value.c b/gdb/python/py-value.c
> > index c843c2c3072..8bd30729454 100644
> > --- a/gdb/python/py-value.c
> > +++ b/gdb/python/py-value.c
> > @@ -743,7 +743,7 @@ valpy_format_string (PyObject *self, PyObject *args=
, PyObject *kw)
> >        GDB_PY_HANDLE_EXCEPTION (except);
> >      }
> > =20
> > -  return PyUnicode_Decode (stb.c_str (), stb.size (), host_charset (),=
 NULL);
> > +  return host_string_to_python_string (stb).release ();
>=20
> I guess that a value here could easily be a string with non-ascii
> characters.  Also, note what the comment of the function says:
>=20
>   /* Implementation of gdb.Value.format_string (...) -> string.
>      Return Unicode string with value contents formatted using the
>      keyword-only arguments.  */
>=20
>=20
> Simon
>=20