public inbox for gdb-prs@sourceware.org
help / color / mirror / Atom feed
* [Bug python/17138] New: C strings, gdb.Value.__str__, and Python 3
@ 2014-07-10  2:47 naesten at gmail dot com
  2014-07-10  4:59 ` [Bug python/17138] " b.r.longbons at gmail dot com
  2023-09-13 14:47 ` tromey at sourceware dot org
  0 siblings, 2 replies; 3+ messages in thread
From: naesten at gmail dot com @ 2014-07-10  2:47 UTC (permalink / raw)
  To: gdb-prs

https://sourceware.org/bugzilla/show_bug.cgi?id=17138

            Bug ID: 17138
           Summary: C strings, gdb.Value.__str__, and Python 3
           Product: gdb
           Version: 7.7
            Status: NEW
          Severity: normal
          Priority: P2
         Component: python
          Assignee: unassigned at sourceware dot org
          Reporter: naesten at gmail dot com

I wanted to see how GDB's Python support dealt with strange C strings in Python
3 (using a build of 7.7.1 based on the Debian packaging git).

Now, I should probably start by reminding everyone that Python 3 has changed
the rules for strings: where in Python 2 the "" syntax and the corresponding
function-like-class, str(), implicitly refer to byte strings of no particular
encoding, in Python 3 they refer to Unicode strings, though code units can
nevertheless be 1, 2, or 4 bytes long.  Python 3 (and 2.6+) have a new b""
syntax and bytes() type for strings of bytes (which while they might sometimes
resemble text, should never be confused with actual text, unless of course they
actually do represent text, in which case they should be decoded).

Note: Probably all of the str() calls in the following are technically
redundant with the use of print(), but for clarity I will include them anyway. 
The parentheses around the argument to print are mandatory in Python 3, as the
print keyword has been replaced by a builtin function.

The first thing I tried had what looked like VERY strange results:

(gdb) python print(str(gdb.parse_and_eval('"foo\x80"')))
"foo\302\200"
(gdb) 

... until I realized that the escape was presumably being handled by Python
here, and so was treated as referring to U+0080, so GDB just encoded it as
UTF-8 before trying to parse it, with the obvious results.

So next I tried:

(gdb) python print(str(gdb.parse_and_eval('"foo\\x80"')))
"foo\200"
(gdb) 

... which looks like GDB just invented UCS-1.

I also tried calling functions like len() and bytes() on these char* values,
only to find that they were not implemented.

Around this point, I decided to consult the documentation, which I discovered
did not mention the __str__() method *anywhere*, but did talk of a string()
method, so I tried that out instead:

(gdb) python print(gdb.parse_and_eval('"foo\\x80"').string())
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3: invalid
start byte
Error while executing Python code.
(gdb) 

At last, results that I can actually understand!  (Unhelpful though they may
be.)

Are we sure this is the right default here?  Might it not make more sense to
return bytes unless specifically asked for an encoding?

At the very least, we definitely provide a way to get uninterpreted bytes in a
bytes() object for Python 2.6+.

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 3+ messages in thread

* [Bug python/17138] C strings, gdb.Value.__str__, and Python 3
  2014-07-10  2:47 [Bug python/17138] New: C strings, gdb.Value.__str__, and Python 3 naesten at gmail dot com
@ 2014-07-10  4:59 ` b.r.longbons at gmail dot com
  2023-09-13 14:47 ` tromey at sourceware dot org
  1 sibling, 0 replies; 3+ messages in thread
From: b.r.longbons at gmail dot com @ 2014-07-10  4:59 UTC (permalink / raw)
  To: gdb-prs

https://sourceware.org/bugzilla/show_bug.cgi?id=17138

Ben Longbons <b.r.longbons at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |b.r.longbons at gmail dot com

--- Comment #1 from Ben Longbons <b.r.longbons at gmail dot com> ---
Having done a lot, there are only a couple of solutions for dealing with
strings:

1. Avoid unicode strings entirely, use 'bytes', avoid operations that differ
(particularly, indexing, which returns an integer in python3). Disadvantage:
lots of operations are not available on 'bytes' in python3.
2. Use unicode everywhere, with errors='surrogateescape'. Disadvantages: have
to put up with lots of whining from the Python community about how you don't
understand strings; has no C implementation in python2 and you have to bundle
the python version.
3. Use unicode in python3, bytes in python2. Advantage: avoids most of the
language-feature problems. Disadvantage: *lots* of opportunities for subtle
bugs, such as the ones mentioned in this bug report.

You'll note that all the real difficulties occur only in Python3, since it
*insists* that you have absolute knowledge about and control over your users
(this has caused no end of pain for people writing webservers).

3 is what programs do if you don't pay any attention. 2 is feasible if you are
developing new code with python3 as your primary target (from __future__ import
unicode_literals). 1 is the most correct for the kind of work gdb is doing, but
can be painful without a. But that leads us to:


4. Invent an entire new string type that just DTRT in both python2 and python3.
This *should* be possible as long as everyone duck types. It probably *is* safe
to assume that any unicode string you get (mostly, from python string literals)
is safe to treat as utf-8 (most of them will be ascii anyway), but for the vast
majority of your code, you can just deal with byte strings in whatever encoding
the inferior wants.

In approach 4, all functions that take strings, just need to feed them through
the new string factory, so there's not a lot of pain on callers. There is,
however, a problem that you can't use *builtin* functions on strings,
particularly you can't write: '%s %s' % (a, b), you have to write bstring('%s
%s') % (a, b). The best we can do for this case is try to see if it's possible
to make that always throw an exception, so at least they will fail quickly.
Unless maybe we hook in an AST rewriter like py.test does ...

If all this seems too complicated:

5. Just stick with python2 forever, and acheive ultimate success by simply
ignoring python3 and unicode.

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 3+ messages in thread

* [Bug python/17138] C strings, gdb.Value.__str__, and Python 3
  2014-07-10  2:47 [Bug python/17138] New: C strings, gdb.Value.__str__, and Python 3 naesten at gmail dot com
  2014-07-10  4:59 ` [Bug python/17138] " b.r.longbons at gmail dot com
@ 2023-09-13 14:47 ` tromey at sourceware dot org
  1 sibling, 0 replies; 3+ messages in thread
From: tromey at sourceware dot org @ 2023-09-13 14:47 UTC (permalink / raw)
  To: gdb-prs

https://sourceware.org/bugzilla/show_bug.cgi?id=17138

Tom Tromey <tromey at sourceware dot org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |tromey at sourceware dot org
             Status|NEW                         |RESOLVED
         Resolution|---                         |OBSOLETE

--- Comment #2 from Tom Tromey <tromey at sourceware dot org> ---
I think this area has been cleaned up somewhat and now this bug
is obsolete.

Python 2 is no longer supported.

Value.string does try to make a Python string and will fail if
the encoding is wrong.  It works like Python decoders, though.

Value.lazy_string can be used from pretty-printers to defer
decisions to gdb's internal decoding code.  When printing
this can handle possibly-incorrect contents.

Memory can be read directly now and returned as a memoryview
object.  This avoids all encoding problems and lets code
deal with just bytes.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2023-09-13 14:47 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-10  2:47 [Bug python/17138] New: C strings, gdb.Value.__str__, and Python 3 naesten at gmail dot com
2014-07-10  4:59 ` [Bug python/17138] " b.r.longbons at gmail dot com
2023-09-13 14:47 ` tromey at sourceware dot org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).