From mboxrd@z Thu Jan 1 00:00:00 1970 From: Per Bothner To: Ariel Rios Cc: Owen Taylor , guile@sourceware.cygnus.com, guile-gtk@sourceware.cygnus.com Subject: Re: Fwd: [[Gnome-bindings] Strings and bindings] Date: Sun, 16 Apr 2000 12:45:00 -0000 Message-id: References: <20000416043318.14641.qmail@nwcst293.netaddress.usa.net> X-SW-Source: 2000-q2/msg00020.html (Feel free to forward this appropriately.) Owen Taylor writes > The Unicode standard is currently only using a 16-bit characters, > all common characters for living languages are planned to be > included in the 16-bit space, and many systems do use 16-bit > characters. (Windows, Java, Python) > > Howevever, there will soon be some character sets defined out > side of the 16-bit "Basic Multilingual Plane", and allowing > 32-bit characters, is, IMO, nicer than confining oneself to > an almost-full character space. Using 16 bits should not be a problem. Unicode has support for "surrogates". This is an extension mechanism to support allowing 20 bits to be encoded using 2 16-bit Unicode characters. That 20-bit space is *far* from full - as far as I know, it is still officially empty (though proposals have been made for rare scripts and symbols). > - Create an STL-string-like wrapper for a utf8 string. The > problem here is that you don't get O(1) random access, which > will no doubt disturb some of the people reading this. But there is almost nothing useful you can do with strings that requires O(1) random access using a character index, at least once you're already dealing with non-trivial characters sets. What you sometimes need is efficient access to a position in the string, but that can be a "magic cookie" represented using a byte offset. So using UTF8 is perfectly reasonable. Using 16-bit Unicode with surrogates is also perfectly reasonable. Using arrays of 32-bit wide characters does not make sense to me (though I know that glibc maintainer Ulrich Drepper feels strongly otherwise). -- --Per Bothner per@bothner.com http://www.bothner.com/~per/