* Re: new:mbrtoc32.3: convert from to c32
[not found] <60cfa510.GHWZSa6DNoE9MWRF%Radisson97@gmx.de>
@ 2021-07-03 17:40 ` Alejandro Colomar (man-pages)
2021-07-03 18:01 ` Alejandro Colomar (man-pages)
0 siblings, 1 reply; 8+ messages in thread
From: Alejandro Colomar (man-pages) @ 2021-07-03 17:40 UTC (permalink / raw)
To: Radisson97; +Cc: linux-man, GNU C Library
Hi Peter,
On 6/20/21 10:29 PM, Radisson97@gmx.de wrote:
> From eb1ee6439f85b6a349c84488fa63dc7b795e43a0 Mon Sep 17 00:00:00 2001
> From: Peter Radisson <--show-origin>
> Date: Sun, 20 Jun 2021 22:21:55 +0200
> Subject: [PATCH] convert between multibyte sequence and 32-bit wide character
>
> documentation including example
>
> Signed-off-by: Peter Radisson <--show-origin>
Thanks for the page. I'll have a look at it.
BTW, next time you document a glibc function from scratch, please CC
glibc <libc-alpha@sourceware.org> so that they can comment, and maybe
find some bugs that we may not be able to detect.
Also, providing a rendered version of the page is good for glibc people
--who may not have cloned the man-pages-- to easily review it :)
Rendered page:
[[
MBRTOC32(3) Linux Programmer's Manual MBRTOC32(3)
NAME
mbrtoc32, c32rtomb - convert between multibyte sequence
and 32‐bit wide character
SYNOPSIS
#include <uchar.h>
size_t t mbrtoc32 (char32_t * restrict c32 ,
mbstate_t * restrict p);
size_t c32rtomb (char * restrict s, char32_t c32 ,
mbstate_t * restrict p );
DESCRIPTION
The mbrtoc32() function inspects at most n bytes of the
UTF‐8 multibyte string starting at s. If a multibyte is
identified as valid the corresponding UCS‐32 32‐bit wide
character is stored in c32. If the multibyte charac‐
ter is the null wide character, it resets the shift state
*p to the initial state and returns 0. If p is NULL, a
static anonymous state known only to the function is used
instead.
The c32rtomb() function converts the 32‐bit wide charac‐
ter stored in c32 into a mutability sequence into the
memory s.
RETURN VALUES
The mbrtoc32() function returns 0 for the nul character.
-1 for invalid input, -2 for a truncated input, -3 for
multibyte 32‐bit wide character sequence that is written
to *c32. No bytes are processed from the input
Otherwise the number of bytes in the multibyte sequence
is returned.
The c32tombr() function returns -1 on error otherwise the
number of bytes used for the multibytes sequence.
EXAMPLE
The input sequence is written as byte sequence to allow a
proper display. Note that the input is UTF‐8 and UTF‐32 ,
it may not possible to convert every code.
#include <stdio.h>
#include <stdlib.h>
#include <locale.h>
#include <uchar.h>
#include <wchar.h>
void toc32( char *in, int in_len, char32_t **outbuf, int *len)
{
char *p_in , *end ;
char32_t *p_out,*out;
size_t rc;
out=malloc(in_len*sizeof(*out));
p_out = out;
p_in = in;
end = in + in_len;
while((rc = mbrtoc32(p_out, p_in, end ‐ p_in, NULL)))
{
if(rc == ‐1) // invalid input
break;
else if(rc == (size_t)‐2) // truncated input
break;
else if(rc == (size_t)‐3) // UTF‐32 high surrogate
p_out += 1;
else {
p_in += rc;
p_out += 1;
};
}
// out_sz = p_out ‐ out + 1;
*len=p_out ‐ out + 1;
*outbuf=out;
}
void fromc32(char32_t *in, int in_len, char **outbuf, int *len)
{
char *out,*p;
int i;
size_t rc;
p=out=malloc(MB_CUR_MAX * in_len);
for(i=0;i<in_len;i++) {
rc=c32rtomb(p, in[i], NULL);
if(rc == (size_t)‐1) break;
p += rc;
}
*outbuf=out;
*len=p‐out+1;
}
void dump_u8(char *in, int len)
{
int i;
printf("Processing %d UTF‐8 code units: [ ", len);
for(i = 0; i <len ; ++i) printf("%#x ", (unsigned char)in[i]);
puts("]");
}
void dump_u32(char32_t *in, int len)
{
int i;
printf("Processing %d UTF‐32 code units: [ ", len);
for(i = 0; i < len; ++i) printf("0x%04x ", in[i]);
puts("]");
}
int main(void){
char in[] = "z00df6c34U0001F34C";
char32_t *out;
int out_len,len;
char *p;
// make sure we have utf8
setlocale(LC_ALL, "de_DE.utf8");
dump_u8(in,sizeof in / sizeof *in);
toc32(in,sizeof in / sizeof *in,&out,&out_len);
dump_u32(out,out_len);
fromc32(out,out_len,&p,&len);
dump_u8(p,len);
return 0;
}
This is a simple example and not production ready.
CONFORMING TO
C11
SEE ALSO
mbrtoc16(), c16tocmbr(), mbsrtowcs()
Linux 2021‐06‐02 MBRTOC32(3)
]]
> ---
> man3/mbrtoc32.3 | 154 ++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 154 insertions(+)
> create mode 100644 man3/mbrtoc32.3
>
> diff --git a/man3/mbrtoc32.3 b/man3/mbrtoc32.3
> new file mode 100644
> index 000000000..8d0c33de1
> --- /dev/null
> +++ b/man3/mbrtoc32.3
> @@ -0,0 +1,154 @@
> +.TH MBRTOC32 3 "2021-06-02" Linux "Linux Programmer's Manual"
> +.SH NAME
> +mbrtoc32, c32rtomb \- convert between multibyte sequence and 32-bit wide character
> +.SH SYNOPSIS
> +.nf
> +.B #include <uchar.h>
> +.PP
> +.BI "size_t t mbrtoc32 (char32_t * restrict "c32 " ,"
> +.BI" const char *" restrict s " , size_t " n " ,"
> +.BI " mbstate_t * restrict " p ");"
> +.PP
> +.BI "size_t c32rtomb (char * restrict " s ", char32_t " c32 " ,"
> +.BI " mbstate_t * restrict " p " );"
> +.fi
> +.SH DESCRIPTION
> +The
> +.BR mbrtoc32 ()
> +function inspects at most
> +.I n
> +bytes of the UTF-8 multibyte string starting at
> +.IR s .
> +If a multibyte is identified as valid the corresponding UCS-32
> +32-bit wide character is stored in
> +.IR c32 .
> +If the multibyte character is the null wide character, it
> +resets the shift state
> +.I *p
> +to the initial state and returns 0.
> +If
> +.I p
> +is NULL, a static anonymous state known only to the
> +function is used instead.
> +.PP
> +The
> +.BR c32rtomb ()
> +function converts the 32-bit wide character stored in
> +.I c32
> +into a mutability sequence into the memory
> +.IR s .
> +.SH "RETURN VALUES"
> +The
> +.BR mbrtoc32 ()
> +function returns
> +0 for the nul character.
> +\-1 for invalid input,
> +\-2 for a truncated input,
> +\-3 for multibyte 32-bit wide character sequence that is
> +written to
> +.IR *c32 .
> +No bytes are processed from the input
> +.PP
> +Otherwise the number of bytes in the multibyte sequence is returned.
> +.PP
> +The
> +.BR c32tombr ()
> +function returns \-1 on error otherwise the number of bytes used
> +for the multibytes sequence.
> +.SH EXAMPLE
> +The input sequence is written as byte sequence to allow a proper
> +display. Note that the input is UTF-8 and UTF-32 , it may not possible
> +to convert every code.
> +.EX
> +.nf.
> +
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <locale.h>
> +#include <uchar.h>
> +#include <wchar.h>
> +
> +void toc32( char *in, int in_len, char32_t **outbuf, int *len)
> +{
> + char *p_in , *end ;
> + char32_t *p_out,*out;
> + size_t rc;
> +
> + out=malloc(in_len*sizeof(*out));
> + p_out = out;
> + p_in = in;
> + end = in + in_len;
> + while((rc = mbrtoc32(p_out, p_in, end - p_in, NULL)))
> + {
> + if(rc == -1) // invalid input
> + break;
> + else if(rc == (size_t)-2) // truncated input
> + break;
> + else if(rc == (size_t)-3) // UTF-32 high surrogate
> + p_out += 1;
> + else {
> + p_in += rc;
> + p_out += 1;
> + };
> + }
> + // out_sz = p_out - out + 1;
> + *len=p_out - out + 1;
> + *outbuf=out;
> +}
> +
> +void fromc32(char32_t *in, int in_len, char **outbuf, int *len)
> +{
> + char *out,*p;
> + int i;
> + size_t rc;
> + p=out=malloc(MB_CUR_MAX * in_len);
> + for(i=0;i<in_len;i++) {
> + rc=c32rtomb(p, in[i], NULL);
> + if(rc == (size_t)-1) break;
> + p += rc;
> + }
> + *outbuf=out;
> + *len=p-out+1;
> +}
> +
> +void dump_u8(char *in, int len)
> +{
> + int i;
> + printf("Processing %d UTF-8 code units: [ ", len);
> + for(i = 0; i <len ; ++i) printf("%#x ", (unsigned char)in[i]);
> + puts("]");
> +}
> +
> +void dump_u32(char32_t *in, int len)
> +{
> + int i;
> + printf("Processing %d UTF-32 code units: [ ", len);
> + for(i = 0; i < len; ++i) printf("0x%04x ", in[i]);
> + puts("]");
> +
> +}
> +
> +int main(void){
> + char in[] = "z\u00df\u6c34\U0001F34C";
> + char32_t *out;
> + int out_len,len;
> + char *p;
> + // make sure we have utf8
> + setlocale(LC_ALL, "de_DE.utf8");
> + dump_u8(in,sizeof in / sizeof *in);
> + toc32(in,sizeof in / sizeof *in,&out,&out_len);
> + dump_u32(out,out_len);
> + fromc32(out,out_len,&p,&len);
> + dump_u8(p,len);
> + return 0;
> +}
> +
> +.fi
> +.EE
> +This is a simple example and not production ready.
> +.SH "CONFORMING TO"
> +C11
> +.SH "SEE ALSO"
> +.BR mbrtoc16 (),
> +.BR c16tocmbr (),
> +.BR mbsrtowcs ()
> --
> 2.26.2
>
--
Alejandro Colomar
Linux man-pages comaintainer; https://www.kernel.org/doc/man-pages/
http://www.alejandro-colomar.es/
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: new:mbrtoc32.3: convert from to c32
2021-07-03 17:40 ` new:mbrtoc32.3: convert from to c32 Alejandro Colomar (man-pages)
@ 2021-07-03 18:01 ` Alejandro Colomar (man-pages)
2021-07-05 20:31 ` Radisson
0 siblings, 1 reply; 8+ messages in thread
From: Alejandro Colomar (man-pages) @ 2021-07-03 18:01 UTC (permalink / raw)
To: Radisson97, Michael Kerrisk (man-pages); +Cc: linux-man, GNU C Library
Hi Peter,
Please see some comments below.
Thanks,
Alex
On 7/3/21 7:40 PM, Alejandro Colomar (man-pages) wrote:
> Hi Peter,
>
> On 6/20/21 10:29 PM, Radisson97@gmx.de wrote:
>> From eb1ee6439f85b6a349c84488fa63dc7b795e43a0 Mon Sep 17 00:00:00 2001
>> From: Peter Radisson <--show-origin>
>> Date: Sun, 20 Jun 2021 22:21:55 +0200
>> Subject: [PATCH] convert between multibyte sequence and 32-bit wide character
>>
>> documentation including example
>>
>> Signed-off-by: Peter Radisson <--show-origin>
>
> Thanks for the page. I'll have a look at it.
>
> BTW, next time you document a glibc function from scratch, please CC
> glibc <libc-alpha@sourceware.org> so that they can comment, and maybe
> find some bugs that we may not be able to detect.
>
> Also, providing a rendered version of the page is good for glibc people
> --who may not have cloned the man-pages-- to easily review it :)
>
> Rendered page:
>
> [[
> MBRTOC32(3) Linux Programmer's Manual MBRTOC32(3)
>
> NAME
> mbrtoc32, c32rtomb - convert between multibyte sequence
> and 32‐bit wide character
>
> SYNOPSIS
> #include <uchar.h>
>
> size_t t mbrtoc32 (char32_t * restrict c32 ,
> mbstate_t * restrict p);
That prototype seems wrong. See:
.../glibc$ grep_glibc_prototype mbrtoc32;
wcsmbs/uchar.h:57:
extern size_t mbrtoc32 (char32_t *__restrict __pc32,
const char *__restrict __s, size_t __n,
mbstate_t *__restrict __p) __THROW;
>
> size_t c32rtomb (char * restrict s, char32_t c32 ,
> mbstate_t * restrict p );
>
> DESCRIPTION
Are there any important differences compared to the already-documented
and C99-compliant mbrtowc(3) and wcrtomb(3)? I mean, apart from the
types of the parameters.
I think I would refer to mbrtowc(3) (and wcrtomb(3)) and specify here
only differences specific to these functions, such as the type.
Maybe?:
[
These functions are equivalent to mbrtowc(3) and wcrtomb(3), except that
these functions act on char32_t instead of wchar_t.
]
Otherwise we're unnecessarily repeating all of the information. If we
use the same wording, we're just duplicating maintenance issues. If we
use a different wording, readers might wonder why, if they seem to be
the same.
Also, when we need to repeat the text of those pages because of slight
differences, please use the text as similar as possible from those
pages, for the same reasons.
> The mbrtoc32() function inspects at most n bytes of the
> UTF‐8 multibyte string starting at s. If a multibyte is
> identified as valid the corresponding UCS‐32 32‐bit wide
> character is stored in c32. If the multibyte charac‐
> ter is the null wide character, it resets the shift state
> *p to the initial state and returns 0. If p is NULL, a
> static anonymous state known only to the function is used
> instead.
>
> The c32rtomb() function converts the 32‐bit wide charac‐
> ter stored in c32 into a mutability sequence into the
> memory s.
>
> RETURN VALUES
> The mbrtoc32() function returns 0 for the nul character.
> -1 for invalid input, -2 for a truncated input, -3 for
> multibyte 32‐bit wide character sequence that is written
> to *c32. No bytes are processed from the input
>
> Otherwise the number of bytes in the multibyte sequence
> is returned.
>
> The c32tombr() function returns -1 on error otherwise the
> number of bytes used for the multibytes sequence.
>
> EXAMPLE> The input sequence is written as byte sequence to allow a
> proper display. Note that the input is UTF‐8 and UTF‐32 ,
> it may not possible to convert every code.
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <locale.h>
> #include <uchar.h>
> #include <wchar.h>
>
> void toc32( char *in, int in_len, char32_t **outbuf, int *len)
Please, follow the style of other existing pages (it's similar to the
kernel coding style with some exceptions).
Especially, regarding spaces around parentheses and commas (and other
operators).
Also, please use a consistent indentation of 4 spaces.
https://www.kernel.org/doc/html/v4.10/process/coding-style.html#spaces
> {
> char *p_in , *end ;
> char32_t *p_out,*out;
> size_t rc;
>
> out=malloc(in_len*sizeof(*out));
> p_out = out;
> p_in = in;
> end = in + in_len;
> while((rc = mbrtoc32(p_out, p_in, end ‐ p_in, NULL)))
> {
> if(rc == ‐1) // invalid input
> break;
> else if(rc == (size_t)‐2) // truncated input
> break;
> else if(rc == (size_t)‐3) // UTF‐32 high surrogate
> p_out += 1;
> else {
> p_in += rc;
> p_out += 1;
> };
> }
> // out_sz = p_out ‐ out + 1;
> *len=p_out ‐ out + 1;
> *outbuf=out;
> }
>
> void fromc32(char32_t *in, int in_len, char **outbuf, int *len)
> {
> char *out,*p;
> int i;
> size_t rc;
> p=out=malloc(MB_CUR_MAX * in_len);
> for(i=0;i<in_len;i++) {
> rc=c32rtomb(p, in[i], NULL);
> if(rc == (size_t)‐1) break;
> p += rc;
> }
> *outbuf=out;
> *len=p‐out+1;
> }
>
> void dump_u8(char *in, int len)
> {
> int i;
> printf("Processing %d UTF‐8 code units: [ ", len);
> for(i = 0; i <len ; ++i) printf("%#x ", (unsigned char)in[i]);
> puts("]");
> }
>
> void dump_u32(char32_t *in, int len)
> {
> int i;
> printf("Processing %d UTF‐32 code units: [ ", len);
> for(i = 0; i < len; ++i) printf("0x%04x ", in[i]);
> puts("]");
>
> }
>
> int main(void){
> char in[] = "z00df6c34U0001F34C";
> char32_t *out;
> int out_len,len;
> char *p;
> // make sure we have utf8
> setlocale(LC_ALL, "de_DE.utf8");
> dump_u8(in,sizeof in / sizeof *in);
> toc32(in,sizeof in / sizeof *in,&out,&out_len);
> dump_u32(out,out_len);
> fromc32(out,out_len,&p,&len);
> dump_u8(p,len);
> return 0;
> }
>
> This is a simple example and not production ready.
>
> CONFORMING TO
> C11
>
> SEE ALSO
> mbrtoc16(), c16tocmbr(), mbsrtowcs()
>
> Linux 2021‐06‐02 MBRTOC32(3)
> ]]
>
>> ---
>> man3/mbrtoc32.3 | 154 ++++++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 154 insertions(+)
>> create mode 100644 man3/mbrtoc32.3
>>
>> diff --git a/man3/mbrtoc32.3 b/man3/mbrtoc32.3
>> new file mode 100644
>> index 000000000..8d0c33de1
>> --- /dev/null
>> +++ b/man3/mbrtoc32.3
>> @@ -0,0 +1,154 @@
>> +.TH MBRTOC32 3 "2021-06-02" Linux "Linux Programmer's Manual"
>> +.SH NAME
>> +mbrtoc32, c32rtomb \- convert between multibyte sequence and 32-bit wide character
>> +.SH SYNOPSIS
>> +.nf
>> +.B #include <uchar.h>
>> +.PP
>> +.BI "size_t t mbrtoc32 (char32_t * restrict "c32 " ,"
>> +.BI" const char *" restrict s " , size_t " n " ,"
>> +.BI " mbstate_t * restrict " p ");"
>> +.PP
>> +.BI "size_t c32rtomb (char * restrict " s ", char32_t " c32 " ,"
>> +.BI " mbstate_t * restrict " p " );"
>> +.fi
>> +.SH DESCRIPTION
>> +The
>> +.BR mbrtoc32 ()
>> +function inspects at most
>> +.I n
>> +bytes of the UTF-8 multibyte string starting at
>> +.IR s .
>> +If a multibyte is identified as valid the corresponding UCS-32
>> +32-bit wide character is stored in
>> +.IR c32 .
>> +If the multibyte character is the null wide character, it
>> +resets the shift state
>> +.I *p
>> +to the initial state and returns 0.
>> +If
>> +.I p
>> +is NULL, a static anonymous state known only to the
>> +function is used instead.
>> +.PP
>> +The
>> +.BR c32rtomb ()
>> +function converts the 32-bit wide character stored in
>> +.I c32
>> +into a mutability sequence into the memory
>> +.IR s .
>> +.SH "RETURN VALUES"
>> +The
>> +.BR mbrtoc32 ()
>> +function returns
>> +0 for the nul character.
>> +\-1 for invalid input,
>> +\-2 for a truncated input,
>> +\-3 for multibyte 32-bit wide character sequence that is
>> +written to
>> +.IR *c32 .
>> +No bytes are processed from the input
>> +.PP
>> +Otherwise the number of bytes in the multibyte sequence is returned.
>> +.PP
>> +The
>> +.BR c32tombr ()
>> +function returns \-1 on error otherwise the number of bytes used
>> +for the multibytes sequence.
>> +.SH EXAMPLE
>> +The input sequence is written as byte sequence to allow a proper
>> +display. Note that the input is UTF-8 and UTF-32 , it may not possible
>> +to convert every code.
>> +.EX
>> +.nf.
>> +
>> +#include <stdio.h>
>> +#include <stdlib.h>
>> +#include <locale.h>
>> +#include <uchar.h>
>> +#include <wchar.h>
>> +
>> +void toc32( char *in, int in_len, char32_t **outbuf, int *len)
>> +{
>> + char *p_in , *end ;
>> + char32_t *p_out,*out;
>> + size_t rc;
>> +
>> + out=malloc(in_len*sizeof(*out));
>> + p_out = out;
>> + p_in = in;
>> + end = in + in_len;
>> + while((rc = mbrtoc32(p_out, p_in, end - p_in, NULL)))
>> + {
>> + if(rc == -1) // invalid input
>> + break;
>> + else if(rc == (size_t)-2) // truncated input
>> + break;
>> + else if(rc == (size_t)-3) // UTF-32 high surrogate
>> + p_out += 1;
>> + else {
>> + p_in += rc;
>> + p_out += 1;
>> + };
>> + }
>> + // out_sz = p_out - out + 1;
>> + *len=p_out - out + 1;
>> + *outbuf=out;
>> +}
>> +
>> +void fromc32(char32_t *in, int in_len, char **outbuf, int *len)
>> +{
>> + char *out,*p;
>> + int i;
>> + size_t rc;
>> + p=out=malloc(MB_CUR_MAX * in_len);
>> + for(i=0;i<in_len;i++) {
>> + rc=c32rtomb(p, in[i], NULL);
>> + if(rc == (size_t)-1) break;
>> + p += rc;
>> + }
>> + *outbuf=out;
>> + *len=p-out+1;
>> +}
>> +
>> +void dump_u8(char *in, int len)
>> +{
>> + int i;
>> + printf("Processing %d UTF-8 code units: [ ", len);
>> + for(i = 0; i <len ; ++i) printf("%#x ", (unsigned char)in[i]);
>> + puts("]");
>> +}
>> +
>> +void dump_u32(char32_t *in, int len)
>> +{
>> + int i;
>> + printf("Processing %d UTF-32 code units: [ ", len);
>> + for(i = 0; i < len; ++i) printf("0x%04x ", in[i]);
>> + puts("]");
>> +
>> +}
>> +
>> +int main(void){
>> + char in[] = "z\u00df\u6c34\U0001F34C";
>> + char32_t *out;
>> + int out_len,len;
>> + char *p;
>> + // make sure we have utf8
>> + setlocale(LC_ALL, "de_DE.utf8");
>> + dump_u8(in,sizeof in / sizeof *in);
>> + toc32(in,sizeof in / sizeof *in,&out,&out_len);
>> + dump_u32(out,out_len);
>> + fromc32(out,out_len,&p,&len);
>> + dump_u8(p,len);
>> + return 0;
>> +}
>> +
>> +.fi
>> +.EE
>> +This is a simple example and not production ready.
>> +.SH "CONFORMING TO"
>> +C11
>> +.SH "SEE ALSO"
>> +.BR mbrtoc16 (),
>> +.BR c16tocmbr (),
>> +.BR mbsrtowcs ()
>> --
>> 2.26.2
>>
>
--
Alejandro Colomar
Linux man-pages comaintainer; https://www.kernel.org/doc/man-pages/
http://www.alejandro-colomar.es/
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: new:mbrtoc32.3: convert from to c32
2021-07-03 18:01 ` Alejandro Colomar (man-pages)
@ 2021-07-05 20:31 ` Radisson
2021-07-06 10:57 ` Alejandro Colomar (man-pages)
0 siblings, 1 reply; 8+ messages in thread
From: Radisson @ 2021-07-05 20:31 UTC (permalink / raw)
To: Alejandro Colomar (man-pages), Michael Kerrisk (man-pages)
Cc: linux-man, GNU C Library
Am 03.07.21 um 20:01 schrieb Alejandro Colomar (man-pages):
> Hi Peter,
>
> Please see some comments below.
>
> Thanks,
>
> Alex
>
> On 7/3/21 7:40 PM, Alejandro Colomar (man-pages) wrote:
>> Hi Peter,
>>
>> On 6/20/21 10:29 PM, Radisson97@gmx.de wrote:
>>> From eb1ee6439f85b6a349c84488fa63dc7b795e43a0 Mon Sep 17 00:00:00 2001
>>> From: Peter Radisson <--show-origin>
>>> Date: Sun, 20 Jun 2021 22:21:55 +0200
>>> Subject: [PATCH] convert between multibyte sequence and 32-bit wide character
>>>
>>> documentation including example
>>>
>>> Signed-off-by: Peter Radisson <--show-origin>
>> Thanks for the page. I'll have a look at it.
>>
>> BTW, next time you document a glibc function from scratch, please CC
>> glibc <libc-alpha@sourceware.org> so that they can comment, and maybe
>> find some bugs that we may not be able to detect.
>>
>> Also, providing a rendered version of the page is good for glibc people
>> --who may not have cloned the man-pages-- to easily review it :)
>>
>> Rendered page:
>>
>> [[
>> MBRTOC32(3) Linux Programmer's Manual MBRTOC32(3)
>>
>> NAME
>> mbrtoc32, c32rtomb - convert between multibyte sequence
>> and 32‐bit wide character
>>
>> SYNOPSIS
>> #include <uchar.h>
>>
>> size_t t mbrtoc32 (char32_t * restrict c32 ,
>> mbstate_t * restrict p);
> That prototype seems wrong. See:
>
> .../glibc$ grep_glibc_prototype mbrtoc32;
> wcsmbs/uchar.h:57:
> extern size_t mbrtoc32 (char32_t *__restrict __pc32,
> const char *__restrict __s, size_t __n,
> mbstate_t *__restrict __p) __THROW;
easy fix, there is a typo that eats the line
>> size_t c32rtomb (char * restrict s, char32_t c32 ,
>> mbstate_t * restrict p );
>>
>> DESCRIPTION
> Are there any important differences compared to the already-documented
> and C99-compliant mbrtowc(3) and wcrtomb(3)? I mean, apart from the
> types of the parameters.
>
> I think I would refer to mbrtowc(3) (and wcrtomb(3)) and specify here
> only differences specific to these functions, such as the type.
>
> Maybe?:
>
> [
> These functions are equivalent to mbrtowc(3) and wcrtomb(3), except that
> these functions act on char32_t instead of wchar_t.
> ]
>
> Otherwise we're unnecessarily repeating all of the information. If we
> use the same wording, we're just duplicating maintenance issues. If we
> use a different wording, readers might wonder why, if they seem to be
> the same.
>
> Also, when we need to repeat the text of those pages because of slight
> differences, please use the text as similar as possible from those
> pages, for the same reasons.
>
I am no expert on the differences i will make contact with Bruno Haible,
i hope he can help (he wrote already).
>> The mbrtoc32() function inspects at most n bytes of the
>> UTF‐8 multibyte string starting at s. If a multibyte is
>> identified as valid the corresponding UCS‐32 32‐bit wide
>> character is stored in c32. If the multibyte charac‐
>> ter is the null wide character, it resets the shift state
>> *p to the initial state and returns 0. If p is NULL, a
>> static anonymous state known only to the function is used
>> instead.
>>
>> The c32rtomb() function converts the 32‐bit wide charac‐
>> ter stored in c32 into a mutability sequence into the
>> memory s.
>>
>> RETURN VALUES
>> The mbrtoc32() function returns 0 for the nul character.
>> -1 for invalid input, -2 for a truncated input, -3 for
>> multibyte 32‐bit wide character sequence that is written
>> to *c32. No bytes are processed from the input
>>
>> Otherwise the number of bytes in the multibyte sequence
>> is returned.
>>
>> The c32tombr() function returns -1 on error otherwise the
>> number of bytes used for the multibytes sequence.
>>
>> EXAMPLE> The input sequence is written as byte sequence to allow a
>> proper display. Note that the input is UTF‐8 and UTF‐32 ,
>> it may not possible to convert every code.
>>
>> #include <stdio.h>
>> #include <stdlib.h>
>> #include <locale.h>
>> #include <uchar.h>
>> #include <wchar.h>
>>
>> void toc32( char *in, int in_len, char32_t **outbuf, int *len)
> Please, follow the style of other existing pages (it's similar to the
> kernel coding style with some exceptions).
> Especially, regarding spaces around parentheses and commas (and other
> operators).
> Also, please use a consistent indentation of 4 spaces.
>
> https://www.kernel.org/doc/html/v4.10/process/coding-style.html#spaces
can i use: indent -i4 -linux ?
>> {
>> char *p_in , *end ;
>> char32_t *p_out,*out;
>> size_t rc;
>>
>> out=malloc(in_len*sizeof(*out));
>> p_out = out;
>> p_in = in;
>> end = in + in_len;
>> while((rc = mbrtoc32(p_out, p_in, end ‐ p_in, NULL)))
>> {
>> if(rc == ‐1) // invalid input
>> break;
>> else if(rc == (size_t)‐2) // truncated input
>> break;
>> else if(rc == (size_t)‐3) // UTF‐32 high surrogate
>> p_out += 1;
>> else {
>> p_in += rc;
>> p_out += 1;
>> };
>> }
>> // out_sz = p_out ‐ out + 1;
>> *len=p_out ‐ out + 1;
>> *outbuf=out;
>> }
>>
>> void fromc32(char32_t *in, int in_len, char **outbuf, int *len)
>> {
>> char *out,*p;
>> int i;
>> size_t rc;
>> p=out=malloc(MB_CUR_MAX * in_len);
>> for(i=0;i<in_len;i++) {
>> rc=c32rtomb(p, in[i], NULL);
>> if(rc == (size_t)‐1) break;
>> p += rc;
>> }
>> *outbuf=out;
>> *len=p‐out+1;
>> }
>>
>> void dump_u8(char *in, int len)
>> {
>> int i;
>> printf("Processing %d UTF‐8 code units: [ ", len);
>> for(i = 0; i <len ; ++i) printf("%#x ", (unsigned char)in[i]);
>> puts("]");
>> }
>>
>> void dump_u32(char32_t *in, int len)
>> {
>> int i;
>> printf("Processing %d UTF‐32 code units: [ ", len);
>> for(i = 0; i < len; ++i) printf("0x%04x ", in[i]);
>> puts("]");
>>
>> }
>>
>> int main(void){
>> char in[] = "z00df6c34U0001F34C";
>> char32_t *out;
>> int out_len,len;
>> char *p;
>> // make sure we have utf8
>> setlocale(LC_ALL, "de_DE.utf8");
>> dump_u8(in,sizeof in / sizeof *in);
>> toc32(in,sizeof in / sizeof *in,&out,&out_len);
>> dump_u32(out,out_len);
>> fromc32(out,out_len,&p,&len);
>> dump_u8(p,len);
>> return 0;
>> }
>>
>> This is a simple example and not production ready.
>>
>> CONFORMING TO
>> C11
>>
>> SEE ALSO
>> mbrtoc16(), c16tocmbr(), mbsrtowcs()
>>
>> Linux 2021‐06‐02 MBRTOC32(3)
>> ]]
>>
>>> ---
>>> man3/mbrtoc32.3 | 154 ++++++++++++++++++++++++++++++++++++++++++++++++
>>> 1 file changed, 154 insertions(+)
>>> create mode 100644 man3/mbrtoc32.3
>>>
>>> diff --git a/man3/mbrtoc32.3 b/man3/mbrtoc32.3
>>> new file mode 100644
>>> index 000000000..8d0c33de1
>>> --- /dev/null
>>> +++ b/man3/mbrtoc32.3
>>> @@ -0,0 +1,154 @@
>>> +.TH MBRTOC32 3 "2021-06-02" Linux "Linux Programmer's Manual"
>>> +.SH NAME
>>> +mbrtoc32, c32rtomb \- convert between multibyte sequence and 32-bit wide character
>>> +.SH SYNOPSIS
>>> +.nf
>>> +.B #include <uchar.h>
>>> +.PP
>>> +.BI "size_t t mbrtoc32 (char32_t * restrict "c32 " ,"
>>> +.BI" const char *" restrict s " , size_t " n " ,"
>>> +.BI " mbstate_t * restrict " p ");"
>>> +.PP
>>> +.BI "size_t c32rtomb (char * restrict " s ", char32_t " c32 " ,"
>>> +.BI " mbstate_t * restrict " p " );"
>>> +.fi
>>> +.SH DESCRIPTION
>>> +The
>>> +.BR mbrtoc32 ()
>>> +function inspects at most
>>> +.I n
>>> +bytes of the UTF-8 multibyte string starting at
>>> +.IR s .
>>> +If a multibyte is identified as valid the corresponding UCS-32
>>> +32-bit wide character is stored in
>>> +.IR c32 .
>>> +If the multibyte character is the null wide character, it
>>> +resets the shift state
>>> +.I *p
>>> +to the initial state and returns 0.
>>> +If
>>> +.I p
>>> +is NULL, a static anonymous state known only to the
>>> +function is used instead.
>>> +.PP
>>> +The
>>> +.BR c32rtomb ()
>>> +function converts the 32-bit wide character stored in
>>> +.I c32
>>> +into a mutability sequence into the memory
>>> +.IR s .
>>> +.SH "RETURN VALUES"
>>> +The
>>> +.BR mbrtoc32 ()
>>> +function returns
>>> +0 for the nul character.
>>> +\-1 for invalid input,
>>> +\-2 for a truncated input,
>>> +\-3 for multibyte 32-bit wide character sequence that is
>>> +written to
>>> +.IR *c32 .
>>> +No bytes are processed from the input
>>> +.PP
>>> +Otherwise the number of bytes in the multibyte sequence is returned.
>>> +.PP
>>> +The
>>> +.BR c32tombr ()
>>> +function returns \-1 on error otherwise the number of bytes used
>>> +for the multibytes sequence.
>>> +.SH EXAMPLE
>>> +The input sequence is written as byte sequence to allow a proper
>>> +display. Note that the input is UTF-8 and UTF-32 , it may not possible
>>> +to convert every code.
>>> +.EX
>>> +.nf.
>>> +
>>> +#include <stdio.h>
>>> +#include <stdlib.h>
>>> +#include <locale.h>
>>> +#include <uchar.h>
>>> +#include <wchar.h>
>>> +
>>> +void toc32( char *in, int in_len, char32_t **outbuf, int *len)
>>> +{
>>> + char *p_in , *end ;
>>> + char32_t *p_out,*out;
>>> + size_t rc;
>>> +
>>> + out=malloc(in_len*sizeof(*out));
>>> + p_out = out;
>>> + p_in = in;
>>> + end = in + in_len;
>>> + while((rc = mbrtoc32(p_out, p_in, end - p_in, NULL)))
>>> + {
>>> + if(rc == -1) // invalid input
>>> + break;
>>> + else if(rc == (size_t)-2) // truncated input
>>> + break;
>>> + else if(rc == (size_t)-3) // UTF-32 high surrogate
>>> + p_out += 1;
>>> + else {
>>> + p_in += rc;
>>> + p_out += 1;
>>> + };
>>> + }
>>> + // out_sz = p_out - out + 1;
>>> + *len=p_out - out + 1;
>>> + *outbuf=out;
>>> +}
>>> +
>>> +void fromc32(char32_t *in, int in_len, char **outbuf, int *len)
>>> +{
>>> + char *out,*p;
>>> + int i;
>>> + size_t rc;
>>> + p=out=malloc(MB_CUR_MAX * in_len);
>>> + for(i=0;i<in_len;i++) {
>>> + rc=c32rtomb(p, in[i], NULL);
>>> + if(rc == (size_t)-1) break;
>>> + p += rc;
>>> + }
>>> + *outbuf=out;
>>> + *len=p-out+1;
>>> +}
>>> +
>>> +void dump_u8(char *in, int len)
>>> +{
>>> + int i;
>>> + printf("Processing %d UTF-8 code units: [ ", len);
>>> + for(i = 0; i <len ; ++i) printf("%#x ", (unsigned char)in[i]);
>>> + puts("]");
>>> +}
>>> +
>>> +void dump_u32(char32_t *in, int len)
>>> +{
>>> + int i;
>>> + printf("Processing %d UTF-32 code units: [ ", len);
>>> + for(i = 0; i < len; ++i) printf("0x%04x ", in[i]);
>>> + puts("]");
>>> +
>>> +}
>>> +
>>> +int main(void){
>>> + char in[] = "z\u00df\u6c34\U0001F34C";
>>> + char32_t *out;
>>> + int out_len,len;
>>> + char *p;
>>> + // make sure we have utf8
>>> + setlocale(LC_ALL, "de_DE.utf8");
>>> + dump_u8(in,sizeof in / sizeof *in);
>>> + toc32(in,sizeof in / sizeof *in,&out,&out_len);
>>> + dump_u32(out,out_len);
>>> + fromc32(out,out_len,&p,&len);
>>> + dump_u8(p,len);
>>> + return 0;
>>> +}
>>> +
>>> +.fi
>>> +.EE
>>> +This is a simple example and not production ready.
>>> +.SH "CONFORMING TO"
>>> +C11
>>> +.SH "SEE ALSO"
>>> +.BR mbrtoc16 (),
>>> +.BR c16tocmbr (),
>>> +.BR mbsrtowcs ()
>>> --
>>> 2.26.2
>>>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: new:mbrtoc32.3: convert from to c32
2021-07-05 20:31 ` Radisson
@ 2021-07-06 10:57 ` Alejandro Colomar (man-pages)
0 siblings, 0 replies; 8+ messages in thread
From: Alejandro Colomar (man-pages) @ 2021-07-06 10:57 UTC (permalink / raw)
To: Radisson, Michael Kerrisk (man-pages); +Cc: linux-man, GNU C Library
Hi,
On 7/5/21 10:31 PM, Radisson wrote:
>>
> I am no expert on the differences i will make contact with Bruno Haible,
> i hope he can help (he wrote already).
Okay.
>>> void toc32( char *in, int in_len, char32_t **outbuf, int *len)
>> Please, follow the style of other existing pages (it's similar to the
>> kernel coding style with some exceptions).
>> Especially, regarding spaces around parentheses and commas (and other
>> operators).
>> Also, please use a consistent indentation of 4 spaces.
>>
>> https://www.kernel.org/doc/html/v4.10/process/coding-style.html#spaces
> can i use: indent -i4 -linux ?
I don't know. I guess that's an emacs thingy. I use vim.
I don't have vim configured for that, but I'll do now (I use tabs for
everything else, except YAML). I don even have vim configured for man,
because in my manual pages (for my code), I also use tabs.
But based on my vimrc, adapting it for 4-spaces in man, it could be
something like:
set nocindent
set nosmartindent
set noautoindent
set indentexpr=
filetype indent off
filetype plugin indent off
" YAML only works with spaces :(
au filetype yaml setlocal expandtab
au filetype yaml setlocal shiftwidth=8
au filetype yaml setlocal softtabstop=8
au filetype yaml setlocal tabstop=8
au filetype man setlocal expandtab
au filetype man setlocal shiftwidth=4
au filetype man setlocal softtabstop=4
au filetype man setlocal tabstop=4
I hope it helps.
Regards,
Alex
--
Alejandro Colomar
Linux man-pages comaintainer; https://www.kernel.org/doc/man-pages/
http://www.alejandro-colomar.es/
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: new:mbrtoc32.3: convert from to c32
2021-07-05 21:09 ` Radisson
@ 2021-07-06 11:06 ` Alejandro Colomar (man-pages)
0 siblings, 0 replies; 8+ messages in thread
From: Alejandro Colomar (man-pages) @ 2021-07-06 11:06 UTC (permalink / raw)
To: Radisson; +Cc: Bruno Haible, linux-man, libc-alpha, Michael Kerrisk (man-pages)
Hi,
On 7/5/21 11:09 PM, Radisson wrote:
>
>
> Am 05.07.21 um 21:07 schrieb Alejandro Colomar (man-pages):
>> Hello Bruno,
>>
>> On 7/4/21 12:26 PM, Bruno Haible wrote:
>>>> mbrtoc32, c32rtomb \- convert between multibyte sequence and 32-bit
>>>> wide character
>>>
>>> I would suggest two separate man pages for these functions.
>>> Rationale:
>>> It is rare that some code uses mbrtoc32 and c32rtomb in the same
>>> function.
>>> (Basically, functions that do input call mbrtoc32, and functions that do
>>> output call c32rtomb.) And the description of mbrtoc32 is a bit complex.
>>
>> Okay. Indeed, the *wc* functions are documented separately.
>
> I beg your pardon,
> we do not write a program, for the understanding of the function i found
> it much helpful to see the from-to connection.
No pardon needed :)
I don't have any strong feelings about how it should be organized in
files. There are 3 ways I see:
a) mbrtoc32 & mbrtowc together; c32tombr & wctombr together
b) each one in a separate page
c) mbrtoc32 & c32tombr together; mbrtowc separate from wctombr (as is now)
If you think any one is especially better than the rest, do it.
What I would like to especially make clear is the similarities and
differences between those 2 sets of functions. And not rewrite
everything from scratch, because that causes 2 main problems:
* Maintainability: maintatining different pages that say the same in
different ways is not a good thing, IMO.
* Confusion: Readers of the page may get the impression that the 2 sets
of functions are considerably different if they are documented differently.
Thanks,
Alex
>
>>
>>>
>>>> Are there any important differences compared to the already-documented
>>>> and C99-compliant mbrtowc(3) and wcrtomb(3)? I mean, apart from the
>>>> types of the parameters. >
>>> No for c32rtomb, but yes for mbrtoc32: mbrtowc has the special return
>>> values (size_t)-1 and (size_t)-2, whereas mbrtoc32 also has the special
>>> return value (size_t)-3. Although, on glibc currently this special
>>> return value (size_t)-3 cannot occur. But IMO the man page should
>>> mention it nevertheless, otherwise people write code that is not
>>> future-proof.
>>
>> Thanks for those details!
>>
>> Regards,
>>
>> Alex
>>
>>
--
Alejandro Colomar
Linux man-pages comaintainer; https://www.kernel.org/doc/man-pages/
http://www.alejandro-colomar.es/
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: new:mbrtoc32.3: convert from to c32
2021-07-05 19:07 ` Alejandro Colomar (man-pages)
@ 2021-07-05 21:09 ` Radisson
2021-07-06 11:06 ` Alejandro Colomar (man-pages)
0 siblings, 1 reply; 8+ messages in thread
From: Radisson @ 2021-07-05 21:09 UTC (permalink / raw)
To: Alejandro Colomar (man-pages), Bruno Haible, libc-alpha, linux-man
Am 05.07.21 um 21:07 schrieb Alejandro Colomar (man-pages):
> Hello Bruno,
>
> On 7/4/21 12:26 PM, Bruno Haible wrote:
>>> mbrtoc32, c32rtomb \- convert between multibyte sequence and 32-bit
>>> wide character
>>
>> I would suggest two separate man pages for these functions.
>> Rationale:
>> It is rare that some code uses mbrtoc32 and c32rtomb in the same
>> function.
>> (Basically, functions that do input call mbrtoc32, and functions that do
>> output call c32rtomb.) And the description of mbrtoc32 is a bit complex.
>
> Okay. Indeed, the *wc* functions are documented separately.
I beg your pardon,
we do not write a program, for the understanding of the function i found
it much helpful to see the from-to connection.
>
>>
>>> Are there any important differences compared to the already-documented
>>> and C99-compliant mbrtowc(3) and wcrtomb(3)? I mean, apart from the
>>> types of the parameters. >
>> No for c32rtomb, but yes for mbrtoc32: mbrtowc has the special return
>> values (size_t)-1 and (size_t)-2, whereas mbrtoc32 also has the special
>> return value (size_t)-3. Although, on glibc currently this special
>> return value (size_t)-3 cannot occur. But IMO the man page should
>> mention it nevertheless, otherwise people write code that is not
>> future-proof.
>
> Thanks for those details!
>
> Regards,
>
> Alex
>
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: new:mbrtoc32.3: convert from to c32
2021-07-04 10:26 Bruno Haible
@ 2021-07-05 19:07 ` Alejandro Colomar (man-pages)
2021-07-05 21:09 ` Radisson
0 siblings, 1 reply; 8+ messages in thread
From: Alejandro Colomar (man-pages) @ 2021-07-05 19:07 UTC (permalink / raw)
To: Bruno Haible, libc-alpha, linux-man, Peter Radisson
Hello Bruno,
On 7/4/21 12:26 PM, Bruno Haible wrote:
>> mbrtoc32, c32rtomb \- convert between multibyte sequence and 32-bit wide character
>
> I would suggest two separate man pages for these functions.
> Rationale:
> It is rare that some code uses mbrtoc32 and c32rtomb in the same function.
> (Basically, functions that do input call mbrtoc32, and functions that do
> output call c32rtomb.) And the description of mbrtoc32 is a bit complex.
Okay. Indeed, the *wc* functions are documented separately.
>
>> Are there any important differences compared to the already-documented
>> and C99-compliant mbrtowc(3) and wcrtomb(3)? I mean, apart from the
>> types of the parameters. >
> No for c32rtomb, but yes for mbrtoc32: mbrtowc has the special return
> values (size_t)-1 and (size_t)-2, whereas mbrtoc32 also has the special
> return value (size_t)-3. Although, on glibc currently this special
> return value (size_t)-3 cannot occur. But IMO the man page should
> mention it nevertheless, otherwise people write code that is not
> future-proof.
Thanks for those details!
Regards,
Alex
--
Alejandro Colomar
Linux man-pages comaintainer; https://www.kernel.org/doc/man-pages/
http://www.alejandro-colomar.es/
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: new:mbrtoc32.3: convert from to c32
@ 2021-07-04 10:26 Bruno Haible
2021-07-05 19:07 ` Alejandro Colomar (man-pages)
0 siblings, 1 reply; 8+ messages in thread
From: Bruno Haible @ 2021-07-04 10:26 UTC (permalink / raw)
To: alx.manpages, libc-alpha, linux-man, Peter Radisson
> mbrtoc32, c32rtomb \- convert between multibyte sequence and 32-bit wide character
I would suggest two separate man pages for these functions.
Rationale:
It is rare that some code uses mbrtoc32 and c32rtomb in the same function.
(Basically, functions that do input call mbrtoc32, and functions that do
output call c32rtomb.) And the description of mbrtoc32 is a bit complex.
> Are there any important differences compared to the already-documented
> and C99-compliant mbrtowc(3) and wcrtomb(3)? I mean, apart from the
> types of the parameters.
No for c32rtomb, but yes for mbrtoc32: mbrtowc has the special return
values (size_t)-1 and (size_t)-2, whereas mbrtoc32 also has the special
return value (size_t)-3. Although, on glibc currently this special
return value (size_t)-3 cannot occur. But IMO the man page should
mention it nevertheless, otherwise people write code that is not
future-proof.
Bruno
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2021-07-06 11:06 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <60cfa510.GHWZSa6DNoE9MWRF%Radisson97@gmx.de>
2021-07-03 17:40 ` new:mbrtoc32.3: convert from to c32 Alejandro Colomar (man-pages)
2021-07-03 18:01 ` Alejandro Colomar (man-pages)
2021-07-05 20:31 ` Radisson
2021-07-06 10:57 ` Alejandro Colomar (man-pages)
2021-07-04 10:26 Bruno Haible
2021-07-05 19:07 ` Alejandro Colomar (man-pages)
2021-07-05 21:09 ` Radisson
2021-07-06 11:06 ` Alejandro Colomar (man-pages)
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).