From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 13895 invoked by alias); 10 Jul 2014 02:47:15 -0000 Mailing-List: contact gdb-prs-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: gdb-prs-owner@sourceware.org Received: (qmail 13856 invoked by uid 48); 10 Jul 2014 02:47:12 -0000 From: "naesten at gmail dot com" To: gdb-prs@sourceware.org Subject: [Bug python/17138] New: C strings, gdb.Value.__str__, and Python 3 Date: Thu, 10 Jul 2014 02:47:00 -0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gdb X-Bugzilla-Component: python X-Bugzilla-Version: 7.7 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: naesten at gmail dot com X-Bugzilla-Status: NEW X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: unassigned at sourceware dot org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit X-Bugzilla-URL: http://sourceware.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-SW-Source: 2014-q3/txt/msg00034.txt.bz2 https://sourceware.org/bugzilla/show_bug.cgi?id=17138 Bug ID: 17138 Summary: C strings, gdb.Value.__str__, and Python 3 Product: gdb Version: 7.7 Status: NEW Severity: normal Priority: P2 Component: python Assignee: unassigned at sourceware dot org Reporter: naesten at gmail dot com I wanted to see how GDB's Python support dealt with strange C strings in Python 3 (using a build of 7.7.1 based on the Debian packaging git). Now, I should probably start by reminding everyone that Python 3 has changed the rules for strings: where in Python 2 the "" syntax and the corresponding function-like-class, str(), implicitly refer to byte strings of no particular encoding, in Python 3 they refer to Unicode strings, though code units can nevertheless be 1, 2, or 4 bytes long. Python 3 (and 2.6+) have a new b"" syntax and bytes() type for strings of bytes (which while they might sometimes resemble text, should never be confused with actual text, unless of course they actually do represent text, in which case they should be decoded). Note: Probably all of the str() calls in the following are technically redundant with the use of print(), but for clarity I will include them anyway. The parentheses around the argument to print are mandatory in Python 3, as the print keyword has been replaced by a builtin function. The first thing I tried had what looked like VERY strange results: (gdb) python print(str(gdb.parse_and_eval('"foo\x80"'))) "foo\302\200" (gdb) ... until I realized that the escape was presumably being handled by Python here, and so was treated as referring to U+0080, so GDB just encoded it as UTF-8 before trying to parse it, with the obvious results. So next I tried: (gdb) python print(str(gdb.parse_and_eval('"foo\\x80"'))) "foo\200" (gdb) ... which looks like GDB just invented UCS-1. I also tried calling functions like len() and bytes() on these char* values, only to find that they were not implemented. Around this point, I decided to consult the documentation, which I discovered did not mention the __str__() method *anywhere*, but did talk of a string() method, so I tried that out instead: (gdb) python print(gdb.parse_and_eval('"foo\\x80"').string()) Traceback (most recent call last): File "", line 1, in UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3: invalid start byte Error while executing Python code. (gdb) At last, results that I can actually understand! (Unhelpful though they may be.) Are we sure this is the right default here? Might it not make more sense to return bytes unless specifically asked for an encoding? At the very least, we definitely provide a way to get uninterpreted bytes in a bytes() object for Python 2.6+. -- You are receiving this mail because: You are on the CC list for the bug.