Re: [PATCH V4][BZ #18441] fix sorting multibyte charsets with an improper locale

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* Re: [PATCH V4][BZ #18441] fix sorting multibyte charsets with an improper locale
@ 2016-05-23  7:37 Leonhard Holz
  2016-06-17  5:39 ` Leonhard Holz
                   ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: Leonhard Holz @ 2016-05-23  7:37 UTC (permalink / raw)
  To: GNU C Library; +Cc: Carlos O'Donell

Hi Carlos,

>> The approach of the patch is to interprete TABLEMB as a hashtable and find a
>> better hash key. My first try was to somehow "fold" a multibyte character into one
>> byte but that worsened the overall performance a lot. Enhancing the table to 2
>> byte keys works much better while needing a reasonable amount of extra memory.
>
> How much memory on average?
>
> Worst case is all possible 2-byte hash values?
>
> e.g. 256*256*(sizeof(struct element_t)) == ~10MB?
>
> Best case is the cost of the NULL pointer, so 256*256*8 = 0.5MB?

TABLEMB is as an int32_t array, so it's 256*256*4 = 256k.

>> The patch vastly improves the performance of languages with multibyte chars (see
>> zh_CN, hi_IN and ja_JP below). A side effect is that some languages with one-byte chars
>> get a bit slower because of the extra check for the first byte while finding the right
>> sequence in the sequence list . It cannot be avoided since the hash key is not
>> longer equal to the first byte of the sequence. Tests are ok.
>
> Can we use UTF-8-specific knowledge to accelerate the lookup?
>
> For example, you know that E0 is always the start of a 3-byte UTF-8 sequence.
>
> Could you do two checks?
>
> (a) Are we in a 1, 2, 3, 4, 5, or 6 bytes sequence?
> (b) If in a 1 byte sequence use a one-byte table.
> (c) If in a 2-6 byte sequence use the hash-table?

I tried it, but the switches necessary to make that work in locale/weight.h findidx() slow down the lookup (e.g. 5% for
en_US.UTF-8 file listing). Because of the paddings of the collation file it does not save space either. So I would not
take that path.

>>> Will this always work? I'm just wondering about a user generated charmap that they
>>> call 'utf8', which is the other common alias for instance where the dash is not valid
>>> syntax. Probably not since the official name is UTF-8, and that's what you should use.
>>
>> Well, if it does not work it's just a speed penalty. But there is no problem in adding a check for "utf8".
>
> Could you check to see what value 'code_set_name' uses internally?
>
> Is it always 'UTF-8'? If it is, then the check you have is just fine.
>
> Otherwise we should check for utf8 and UTF-8.
>
> We don't know until you verify the values code_set_name can have.

The source of code_set_name is the charmap file (localedata/charmaps/UTF-8) and there it is defined as UTF-8.

>> diff --git a/locale/C-collate.c b/locale/C-collate.c
>> index 8214ff5..5a9ed6a 100644
>> --- a/locale/C-collate.c
>> +++ b/locale/C-collate.c
>> @@ -144,6 +144,8 @@ const struct __locale_data _nl_C_LC_COLLATE attribute_hidden =
>>      /* _NL_COLLATE_COLLSEQWC */
>>      { .string = (const char *) collseqwc },
>>      /* _NL_COLLATE_CODESET */
>> -    { .string = _nl_C_codeset }
>> +    { .string = _nl_C_codeset },
>> +    /* _NL_COLLATE_ENCODING_TYPE */
>> +    { .word = __cet_8bit }
>
> This makes locale-archive incompatible again right?
>
> Users have to regenerate the locale-archive after the upgrade?
>
> We need a release note for that under "Packaging Changes"
> https://sourceware.org/glibc/wiki/Release/2.24#Packaging_Changes
>
> The release note should mention the binary locale-archive format
> has changed and that locale-archive must be removed before upgrading
> and then recompiled after upgrade.

Yes. Who should do it?

Best,
Leonhard

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH V4][BZ #18441] fix sorting multibyte charsets with an improper locale
  2016-05-23  7:37 [PATCH V4][BZ #18441] fix sorting multibyte charsets with an improper locale Leonhard Holz
@ 2016-06-17  5:39 ` Leonhard Holz
  2016-06-27 17:22 ` [PING][PATCH " Leonhard Holz
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 6+ messages in thread
From: Leonhard Holz @ 2016-06-17  5:39 UTC (permalink / raw)
  To: GNU C Library; +Cc: Carlos O'Donell

Any comments on this?

Am 22.05.2016 um 17:52 schrieb Leonhard Holz:
> Hi Carlos,
> 
>>> The approach of the patch is to interprete TABLEMB as a hashtable and find a
>>> better hash key. My first try was to somehow "fold" a multibyte character into one
>>> byte but that worsened the overall performance a lot. Enhancing the table to 2
>>> byte keys works much better while needing a reasonable amount of extra memory.
>>
>> How much memory on average?
>>
>> Worst case is all possible 2-byte hash values?
>>
>> e.g. 256*256*(sizeof(struct element_t)) == ~10MB?
>>
>> Best case is the cost of the NULL pointer, so 256*256*8 = 0.5MB?
> 
> TABLEMB is as an int32_t array, so it's 256*256*4 = 256k.
> 
>>> The patch vastly improves the performance of languages with multibyte chars (see
>>> zh_CN, hi_IN and ja_JP below). A side effect is that some languages with one-byte chars
>>> get a bit slower because of the extra check for the first byte while finding the right
>>> sequence in the sequence list . It cannot be avoided since the hash key is not
>>> longer equal to the first byte of the sequence. Tests are ok.
>>
>> Can we use UTF-8-specific knowledge to accelerate the lookup?
>>
>> For example, you know that E0 is always the start of a 3-byte UTF-8 sequence.
>>
>> Could you do two checks?
>>
>> (a) Are we in a 1, 2, 3, 4, 5, or 6 bytes sequence?
>> (b) If in a 1 byte sequence use a one-byte table.
>> (c) If in a 2-6 byte sequence use the hash-table?
> 
> I tried it, but the switches necessary to make that work in locale/weight.h findidx() slow down the lookup (e.g. 5% for
> en_US.UTF-8 file listing). Because of the paddings of the collation file it does not save space either. So I would not
> take that path.
> 
>>>> Will this always work? I'm just wondering about a user generated charmap that they
>>>> call 'utf8', which is the other common alias for instance where the dash is not valid
>>>> syntax. Probably not since the official name is UTF-8, and that's what you should use.
>>>
>>> Well, if it does not work it's just a speed penalty. But there is no problem in adding a check for "utf8".
>>
>> Could you check to see what value 'code_set_name' uses internally?
>>
>> Is it always 'UTF-8'? If it is, then the check you have is just fine.
>>
>> Otherwise we should check for utf8 and UTF-8.
>>
>> We don't know until you verify the values code_set_name can have.
> 
> The source of code_set_name is the charmap file (localedata/charmaps/UTF-8) and there it is defined as UTF-8.
> 
>>> diff --git a/locale/C-collate.c b/locale/C-collate.c
>>> index 8214ff5..5a9ed6a 100644
>>> --- a/locale/C-collate.c
>>> +++ b/locale/C-collate.c
>>> @@ -144,6 +144,8 @@ const struct __locale_data _nl_C_LC_COLLATE attribute_hidden =
>>>      /* _NL_COLLATE_COLLSEQWC */
>>>      { .string = (const char *) collseqwc },
>>>      /* _NL_COLLATE_CODESET */
>>> -    { .string = _nl_C_codeset }
>>> +    { .string = _nl_C_codeset },
>>> +    /* _NL_COLLATE_ENCODING_TYPE */
>>> +    { .word = __cet_8bit }
>>
>> This makes locale-archive incompatible again right?
>>
>> Users have to regenerate the locale-archive after the upgrade?
>>
>> We need a release note for that under "Packaging Changes"
>> https://sourceware.org/glibc/wiki/Release/2.24#Packaging_Changes
>>
>> The release note should mention the binary locale-archive format
>> has changed and that locale-archive must be removed before upgrading
>> and then recompiled after upgrade.
> 
> Yes. Who should do it?
> 
> Best,
> Leonhard
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PING][PATCH V4][BZ #18441] fix sorting multibyte charsets with an improper locale
  2016-05-23  7:37 [PATCH V4][BZ #18441] fix sorting multibyte charsets with an improper locale Leonhard Holz
  2016-06-17  5:39 ` Leonhard Holz
@ 2016-06-27 17:22 ` Leonhard Holz
  2016-07-12  6:50 ` Leonhard Holz
  2016-08-12  4:30 ` Leonhard Holz
  3 siblings, 0 replies; 6+ messages in thread
From: Leonhard Holz @ 2016-06-27 17:22 UTC (permalink / raw)
  To: GNU C Library; +Cc: Carlos O'Donell

Am 22.05.2016 um 17:52 schrieb Leonhard Holz:
> Hi Carlos,
> 
>>> The approach of the patch is to interprete TABLEMB as a hashtable and find a
>>> better hash key. My first try was to somehow "fold" a multibyte character into one
>>> byte but that worsened the overall performance a lot. Enhancing the table to 2
>>> byte keys works much better while needing a reasonable amount of extra memory.
>>
>> How much memory on average?
>>
>> Worst case is all possible 2-byte hash values?
>>
>> e.g. 256*256*(sizeof(struct element_t)) == ~10MB?
>>
>> Best case is the cost of the NULL pointer, so 256*256*8 = 0.5MB?
> 
> TABLEMB is as an int32_t array, so it's 256*256*4 = 256k.
> 
>>> The patch vastly improves the performance of languages with multibyte chars (see
>>> zh_CN, hi_IN and ja_JP below). A side effect is that some languages with one-byte chars
>>> get a bit slower because of the extra check for the first byte while finding the right
>>> sequence in the sequence list . It cannot be avoided since the hash key is not
>>> longer equal to the first byte of the sequence. Tests are ok.
>>
>> Can we use UTF-8-specific knowledge to accelerate the lookup?
>>
>> For example, you know that E0 is always the start of a 3-byte UTF-8 sequence.
>>
>> Could you do two checks?
>>
>> (a) Are we in a 1, 2, 3, 4, 5, or 6 bytes sequence?
>> (b) If in a 1 byte sequence use a one-byte table.
>> (c) If in a 2-6 byte sequence use the hash-table?
> 
> I tried it, but the switches necessary to make that work in locale/weight.h findidx() slow down the lookup (e.g. 5% for
> en_US.UTF-8 file listing). Because of the paddings of the collation file it does not save space either. So I would not
> take that path.
> 
>>>> Will this always work? I'm just wondering about a user generated charmap that they
>>>> call 'utf8', which is the other common alias for instance where the dash is not valid
>>>> syntax. Probably not since the official name is UTF-8, and that's what you should use.
>>>
>>> Well, if it does not work it's just a speed penalty. But there is no problem in adding a check for "utf8".
>>
>> Could you check to see what value 'code_set_name' uses internally?
>>
>> Is it always 'UTF-8'? If it is, then the check you have is just fine.
>>
>> Otherwise we should check for utf8 and UTF-8.
>>
>> We don't know until you verify the values code_set_name can have.
> 
> The source of code_set_name is the charmap file (localedata/charmaps/UTF-8) and there it is defined as UTF-8.
> 
>>> diff --git a/locale/C-collate.c b/locale/C-collate.c
>>> index 8214ff5..5a9ed6a 100644
>>> --- a/locale/C-collate.c
>>> +++ b/locale/C-collate.c
>>> @@ -144,6 +144,8 @@ const struct __locale_data _nl_C_LC_COLLATE attribute_hidden =
>>>      /* _NL_COLLATE_COLLSEQWC */
>>>      { .string = (const char *) collseqwc },
>>>      /* _NL_COLLATE_CODESET */
>>> -    { .string = _nl_C_codeset }
>>> +    { .string = _nl_C_codeset },
>>> +    /* _NL_COLLATE_ENCODING_TYPE */
>>> +    { .word = __cet_8bit }
>>
>> This makes locale-archive incompatible again right?
>>
>> Users have to regenerate the locale-archive after the upgrade?
>>
>> We need a release note for that under "Packaging Changes"
>> https://sourceware.org/glibc/wiki/Release/2.24#Packaging_Changes
>>
>> The release note should mention the binary locale-archive format
>> has changed and that locale-archive must be removed before upgrading
>> and then recompiled after upgrade.
> 
> Yes. Who should do it?
> 
> Best,
> Leonhard
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PING][PATCH V4][BZ #18441] fix sorting multibyte charsets with an improper locale
  2016-05-23  7:37 [PATCH V4][BZ #18441] fix sorting multibyte charsets with an improper locale Leonhard Holz
  2016-06-17  5:39 ` Leonhard Holz
  2016-06-27 17:22 ` [PING][PATCH " Leonhard Holz
@ 2016-07-12  6:50 ` Leonhard Holz
  2016-08-12  4:30 ` Leonhard Holz
  3 siblings, 0 replies; 6+ messages in thread
From: Leonhard Holz @ 2016-07-12  6:50 UTC (permalink / raw)
  To: GNU C Library; +Cc: Carlos O'Donell

Am 22.05.2016 um 17:52 schrieb Leonhard Holz:
> Hi Carlos,
> 
>>> The approach of the patch is to interprete TABLEMB as a hashtable and find a
>>> better hash key. My first try was to somehow "fold" a multibyte character into one
>>> byte but that worsened the overall performance a lot. Enhancing the table to 2
>>> byte keys works much better while needing a reasonable amount of extra memory.
>>
>> How much memory on average?
>>
>> Worst case is all possible 2-byte hash values?
>>
>> e.g. 256*256*(sizeof(struct element_t)) == ~10MB?
>>
>> Best case is the cost of the NULL pointer, so 256*256*8 = 0.5MB?
> 
> TABLEMB is as an int32_t array, so it's 256*256*4 = 256k.
> 
>>> The patch vastly improves the performance of languages with multibyte chars (see
>>> zh_CN, hi_IN and ja_JP below). A side effect is that some languages with one-byte chars
>>> get a bit slower because of the extra check for the first byte while finding the right
>>> sequence in the sequence list . It cannot be avoided since the hash key is not
>>> longer equal to the first byte of the sequence. Tests are ok.
>>
>> Can we use UTF-8-specific knowledge to accelerate the lookup?
>>
>> For example, you know that E0 is always the start of a 3-byte UTF-8 sequence.
>>
>> Could you do two checks?
>>
>> (a) Are we in a 1, 2, 3, 4, 5, or 6 bytes sequence?
>> (b) If in a 1 byte sequence use a one-byte table.
>> (c) If in a 2-6 byte sequence use the hash-table?
> 
> I tried it, but the switches necessary to make that work in locale/weight.h findidx() slow down the lookup (e.g. 5% for
> en_US.UTF-8 file listing). Because of the paddings of the collation file it does not save space either. So I would not
> take that path.
> 
>>>> Will this always work? I'm just wondering about a user generated charmap that they
>>>> call 'utf8', which is the other common alias for instance where the dash is not valid
>>>> syntax. Probably not since the official name is UTF-8, and that's what you should use.
>>>
>>> Well, if it does not work it's just a speed penalty. But there is no problem in adding a check for "utf8".
>>
>> Could you check to see what value 'code_set_name' uses internally?
>>
>> Is it always 'UTF-8'? If it is, then the check you have is just fine.
>>
>> Otherwise we should check for utf8 and UTF-8.
>>
>> We don't know until you verify the values code_set_name can have.
> 
> The source of code_set_name is the charmap file (localedata/charmaps/UTF-8) and there it is defined as UTF-8.
> 
>>> diff --git a/locale/C-collate.c b/locale/C-collate.c
>>> index 8214ff5..5a9ed6a 100644
>>> --- a/locale/C-collate.c
>>> +++ b/locale/C-collate.c
>>> @@ -144,6 +144,8 @@ const struct __locale_data _nl_C_LC_COLLATE attribute_hidden =
>>>      /* _NL_COLLATE_COLLSEQWC */
>>>      { .string = (const char *) collseqwc },
>>>      /* _NL_COLLATE_CODESET */
>>> -    { .string = _nl_C_codeset }
>>> +    { .string = _nl_C_codeset },
>>> +    /* _NL_COLLATE_ENCODING_TYPE */
>>> +    { .word = __cet_8bit }
>>
>> This makes locale-archive incompatible again right?
>>
>> Users have to regenerate the locale-archive after the upgrade?
>>
>> We need a release note for that under "Packaging Changes"
>> https://sourceware.org/glibc/wiki/Release/2.24#Packaging_Changes
>>
>> The release note should mention the binary locale-archive format
>> has changed and that locale-archive must be removed before upgrading
>> and then recompiled after upgrade.
> 
> Yes. Who should do it?
> 
> Best,
> Leonhard
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PING][PATCH V4][BZ #18441] fix sorting multibyte charsets with an improper locale
  2016-05-23  7:37 [PATCH V4][BZ #18441] fix sorting multibyte charsets with an improper locale Leonhard Holz
                   ` (2 preceding siblings ...)
  2016-07-12  6:50 ` Leonhard Holz
@ 2016-08-12  4:30 ` Leonhard Holz
  3 siblings, 0 replies; 6+ messages in thread
From: Leonhard Holz @ 2016-08-12  4:30 UTC (permalink / raw)
  To: GNU C Library; +Cc: Carlos O'Donell

> Hi Carlos,
> 
>>> The approach of the patch is to interprete TABLEMB as a hashtable and find a
>>> better hash key. My first try was to somehow "fold" a multibyte character into one
>>> byte but that worsened the overall performance a lot. Enhancing the table to 2
>>> byte keys works much better while needing a reasonable amount of extra memory.
>>
>> How much memory on average?
>>
>> Worst case is all possible 2-byte hash values?
>>
>> e.g. 256*256*(sizeof(struct element_t)) == ~10MB?
>>
>> Best case is the cost of the NULL pointer, so 256*256*8 = 0.5MB?
> 
> TABLEMB is as an int32_t array, so it's 256*256*4 = 256k.
> 
>>> The patch vastly improves the performance of languages with multibyte chars (see
>>> zh_CN, hi_IN and ja_JP below). A side effect is that some languages with one-byte chars
>>> get a bit slower because of the extra check for the first byte while finding the right
>>> sequence in the sequence list . It cannot be avoided since the hash key is not
>>> longer equal to the first byte of the sequence. Tests are ok.
>>
>> Can we use UTF-8-specific knowledge to accelerate the lookup?
>>
>> For example, you know that E0 is always the start of a 3-byte UTF-8 sequence.
>>
>> Could you do two checks?
>>
>> (a) Are we in a 1, 2, 3, 4, 5, or 6 bytes sequence?
>> (b) If in a 1 byte sequence use a one-byte table.
>> (c) If in a 2-6 byte sequence use the hash-table?
> 
> I tried it, but the switches necessary to make that work in locale/weight.h findidx() slow down the lookup (e.g. 5% for
> en_US.UTF-8 file listing). Because of the paddings of the collation file it does not save space either. So I would not
> take that path.
> 
>>>> Will this always work? I'm just wondering about a user generated charmap that they
>>>> call 'utf8', which is the other common alias for instance where the dash is not valid
>>>> syntax. Probably not since the official name is UTF-8, and that's what you should use.
>>>
>>> Well, if it does not work it's just a speed penalty. But there is no problem in adding a check for "utf8".
>>
>> Could you check to see what value 'code_set_name' uses internally?
>>
>> Is it always 'UTF-8'? If it is, then the check you have is just fine.
>>
>> Otherwise we should check for utf8 and UTF-8.
>>
>> We don't know until you verify the values code_set_name can have.
> 
> The source of code_set_name is the charmap file (localedata/charmaps/UTF-8) and there it is defined as UTF-8.
> 
>>> diff --git a/locale/C-collate.c b/locale/C-collate.c
>>> index 8214ff5..5a9ed6a 100644
>>> --- a/locale/C-collate.c
>>> +++ b/locale/C-collate.c
>>> @@ -144,6 +144,8 @@ const struct __locale_data _nl_C_LC_COLLATE attribute_hidden =
>>>      /* _NL_COLLATE_COLLSEQWC */
>>>      { .string = (const char *) collseqwc },
>>>      /* _NL_COLLATE_CODESET */
>>> -    { .string = _nl_C_codeset }
>>> +    { .string = _nl_C_codeset },
>>> +    /* _NL_COLLATE_ENCODING_TYPE */
>>> +    { .word = __cet_8bit }
>>
>> This makes locale-archive incompatible again right?
>>
>> Users have to regenerate the locale-archive after the upgrade?
>>
>> We need a release note for that under "Packaging Changes"
>> https://sourceware.org/glibc/wiki/Release/2.24#Packaging_Changes
>>
>> The release note should mention the binary locale-archive format
>> has changed and that locale-archive must be removed before upgrading
>> and then recompiled after upgrade.
> 
> Yes. Who should do it?
> 
> Best,
> Leonhard
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PING][PATCH V4][BZ #18441] fix sorting multibyte charsets with an improper locale
@ 2016-03-20 21:13 Leonhard Holz
  0 siblings, 0 replies; 6+ messages in thread
From: Leonhard Holz @ 2016-03-20 21:13 UTC (permalink / raw)
  To: GNU C Library

Please review https://sourceware.org/ml/libc-alpha/2016-02/msg00874.html.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2016-08-12  4:30 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-23  7:37 [PATCH V4][BZ #18441] fix sorting multibyte charsets with an improper locale Leonhard Holz
2016-06-17  5:39 ` Leonhard Holz
2016-06-27 17:22 ` [PING][PATCH " Leonhard Holz
2016-07-12  6:50 ` Leonhard Holz
2016-08-12  4:30 ` Leonhard Holz
  -- strict thread matches above, loose matches on Subject: below --
2016-03-20 21:13 Leonhard Holz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).