public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed
* [Bug runtime/14487] New: need better UTF-8 handling
@ 2012-08-18  1:01 jistone at redhat dot com
  2017-10-17 16:51 ` [Bug runtime/14487] " carlos at redhat dot com
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: jistone at redhat dot com @ 2012-08-18  1:01 UTC (permalink / raw)
  To: systemtap

http://sourceware.org/bugzilla/show_bug.cgi?id=14487

             Bug #: 14487
           Summary: need better UTF-8 handling
           Product: systemtap
           Version: unspecified
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: runtime
        AssignedTo: systemtap@sourceware.org
        ReportedBy: jistone@redhat.com
    Classification: Unclassified


We generally take the blissful approach that all strings are merely
0-terminated byte sequences, and we don't care much about the meaning of those
bytes.

This breaks down in any instance where we start splitting up those bytes
though.  The most obvious case is with any truncation at MAXSTRINGLEN.  This
could lead to an incomplete UTF-8 sequence at the tail.  (Fortunately UTF-8 is
robust enough that this only corrupts one Unicode character in the output.)  We
also have functions like substr() which count by bytes rather than characters.

It's not clear that we can solve this 100%, but if we choose to commit to a
worldview that all strings are utf-8, then we could make and use our own
runtime strlcpy8, strlcat8, etc. functions which preserve boundaries.

Even then, this is preserving only *code points*, whereas one may really have
composite characters with combining diacritical marks and such.  I believe
combining characters are in specific ranges (though new Unicode versions can
expand this), so really fancy runtime functions might preserve these
connections too.

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug runtime/14487] need better UTF-8 handling
  2012-08-18  1:01 [Bug runtime/14487] New: need better UTF-8 handling jistone at redhat dot com
@ 2017-10-17 16:51 ` carlos at redhat dot com
  2017-10-17 19:59 ` jistone at redhat dot com
  2017-10-17 20:01 ` carlos at redhat dot com
  2 siblings, 0 replies; 4+ messages in thread
From: carlos at redhat dot com @ 2017-10-17 16:51 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=14487

Carlos O'Donell <carlos at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
                 CC|                            |carlos at redhat dot com
         Resolution|---                         |WONTFIX

--- Comment #1 from Carlos O'Donell <carlos at redhat dot com> ---
There are no APIs in glibc that are rich enough to provide the functionality
you are interested in acessing.

The only API that has this kind of support is libicu, but even then the API is
immensely expressive, and almost has 1:1 mappings with Unicode data, so you can
directly use that API to ask questions about characters and combining
characters etc, but you have to know what you're doing and understand the
details. We would likely never add something like this that is so specific to
Unicode.

Therefore I'm going to close this bug as RESOLVED/WONTFIX, since we have no
plan to add such APIs.

For now we recommend you use libicu for the complex functions and glibc for the
basic routines.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug runtime/14487] need better UTF-8 handling
  2012-08-18  1:01 [Bug runtime/14487] New: need better UTF-8 handling jistone at redhat dot com
  2017-10-17 16:51 ` [Bug runtime/14487] " carlos at redhat dot com
@ 2017-10-17 19:59 ` jistone at redhat dot com
  2017-10-17 20:01 ` carlos at redhat dot com
  2 siblings, 0 replies; 4+ messages in thread
From: jistone at redhat dot com @ 2017-10-17 19:59 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=14487

--- Comment #2 from Josh Stone <jistone at redhat dot com> ---
FWIW, this was about the systemtap runtime, in-kernel, not a request for glibc.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug runtime/14487] need better UTF-8 handling
  2012-08-18  1:01 [Bug runtime/14487] New: need better UTF-8 handling jistone at redhat dot com
  2017-10-17 16:51 ` [Bug runtime/14487] " carlos at redhat dot com
  2017-10-17 19:59 ` jistone at redhat dot com
@ 2017-10-17 20:01 ` carlos at redhat dot com
  2 siblings, 0 replies; 4+ messages in thread
From: carlos at redhat dot com @ 2017-10-17 20:01 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=14487

Carlos O'Donell <carlos at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |UNCONFIRMED
         Resolution|WONTFIX                     |---
     Ever confirmed|1                           |0

--- Comment #3 from Carlos O'Donell <carlos at redhat dot com> ---
(In reply to Josh Stone from comment #2)
> FWIW, this was about the systemtap runtime, in-kernel, not a request for
> glibc.

Oh jeez! Sorry. I ran some bugzilla quieries I had and I guess this one didn't
filter by component :} Either way the recommendation remains the same! Use
libicu! :-)

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2017-10-17 20:01 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-08-18  1:01 [Bug runtime/14487] New: need better UTF-8 handling jistone at redhat dot com
2017-10-17 16:51 ` [Bug runtime/14487] " carlos at redhat dot com
2017-10-17 19:59 ` jistone at redhat dot com
2017-10-17 20:01 ` carlos at redhat dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).