* Re: [PATCH V4][BZ #18441] fix sorting multibyte charsets with an improper locale
@ 2016-05-23 7:37 Leonhard Holz
2016-06-17 5:39 ` Leonhard Holz
` (3 more replies)
0 siblings, 4 replies; 6+ messages in thread
From: Leonhard Holz @ 2016-05-23 7:37 UTC (permalink / raw)
To: GNU C Library; +Cc: Carlos O'Donell
Hi Carlos,
>> The approach of the patch is to interprete TABLEMB as a hashtable and find a
>> better hash key. My first try was to somehow "fold" a multibyte character into one
>> byte but that worsened the overall performance a lot. Enhancing the table to 2
>> byte keys works much better while needing a reasonable amount of extra memory.
>
> How much memory on average?
>
> Worst case is all possible 2-byte hash values?
>
> e.g. 256*256*(sizeof(struct element_t)) == ~10MB?
>
> Best case is the cost of the NULL pointer, so 256*256*8 = 0.5MB?
TABLEMB is as an int32_t array, so it's 256*256*4 = 256k.
>> The patch vastly improves the performance of languages with multibyte chars (see
>> zh_CN, hi_IN and ja_JP below). A side effect is that some languages with one-byte chars
>> get a bit slower because of the extra check for the first byte while finding the right
>> sequence in the sequence list . It cannot be avoided since the hash key is not
>> longer equal to the first byte of the sequence. Tests are ok.
>
> Can we use UTF-8-specific knowledge to accelerate the lookup?
>
> For example, you know that E0 is always the start of a 3-byte UTF-8 sequence.
>
> Could you do two checks?
>
> (a) Are we in a 1, 2, 3, 4, 5, or 6 bytes sequence?
> (b) If in a 1 byte sequence use a one-byte table.
> (c) If in a 2-6 byte sequence use the hash-table?
I tried it, but the switches necessary to make that work in locale/weight.h findidx() slow down the lookup (e.g. 5% for
en_US.UTF-8 file listing). Because of the paddings of the collation file it does not save space either. So I would not
take that path.
>>> Will this always work? I'm just wondering about a user generated charmap that they
>>> call 'utf8', which is the other common alias for instance where the dash is not valid
>>> syntax. Probably not since the official name is UTF-8, and that's what you should use.
>>
>> Well, if it does not work it's just a speed penalty. But there is no problem in adding a check for "utf8".
>
> Could you check to see what value 'code_set_name' uses internally?
>
> Is it always 'UTF-8'? If it is, then the check you have is just fine.
>
> Otherwise we should check for utf8 and UTF-8.
>
> We don't know until you verify the values code_set_name can have.
The source of code_set_name is the charmap file (localedata/charmaps/UTF-8) and there it is defined as UTF-8.
>> diff --git a/locale/C-collate.c b/locale/C-collate.c
>> index 8214ff5..5a9ed6a 100644
>> --- a/locale/C-collate.c
>> +++ b/locale/C-collate.c
>> @@ -144,6 +144,8 @@ const struct __locale_data _nl_C_LC_COLLATE attribute_hidden =
>> /* _NL_COLLATE_COLLSEQWC */
>> { .string = (const char *) collseqwc },
>> /* _NL_COLLATE_CODESET */
>> - { .string = _nl_C_codeset }
>> + { .string = _nl_C_codeset },
>> + /* _NL_COLLATE_ENCODING_TYPE */
>> + { .word = __cet_8bit }
>
> This makes locale-archive incompatible again right?
>
> Users have to regenerate the locale-archive after the upgrade?
>
> We need a release note for that under "Packaging Changes"
> https://sourceware.org/glibc/wiki/Release/2.24#Packaging_Changes
>
> The release note should mention the binary locale-archive format
> has changed and that locale-archive must be removed before upgrading
> and then recompiled after upgrade.
Yes. Who should do it?
Best,
Leonhard
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH V4][BZ #18441] fix sorting multibyte charsets with an improper locale
2016-05-23 7:37 [PATCH V4][BZ #18441] fix sorting multibyte charsets with an improper locale Leonhard Holz
@ 2016-06-17 5:39 ` Leonhard Holz
2016-06-27 17:22 ` [PING][PATCH " Leonhard Holz
` (2 subsequent siblings)
3 siblings, 0 replies; 6+ messages in thread
From: Leonhard Holz @ 2016-06-17 5:39 UTC (permalink / raw)
To: GNU C Library; +Cc: Carlos O'Donell
Any comments on this?
Am 22.05.2016 um 17:52 schrieb Leonhard Holz:
> Hi Carlos,
>
>>> The approach of the patch is to interprete TABLEMB as a hashtable and find a
>>> better hash key. My first try was to somehow "fold" a multibyte character into one
>>> byte but that worsened the overall performance a lot. Enhancing the table to 2
>>> byte keys works much better while needing a reasonable amount of extra memory.
>>
>> How much memory on average?
>>
>> Worst case is all possible 2-byte hash values?
>>
>> e.g. 256*256*(sizeof(struct element_t)) == ~10MB?
>>
>> Best case is the cost of the NULL pointer, so 256*256*8 = 0.5MB?
>
> TABLEMB is as an int32_t array, so it's 256*256*4 = 256k.
>
>>> The patch vastly improves the performance of languages with multibyte chars (see
>>> zh_CN, hi_IN and ja_JP below). A side effect is that some languages with one-byte chars
>>> get a bit slower because of the extra check for the first byte while finding the right
>>> sequence in the sequence list . It cannot be avoided since the hash key is not
>>> longer equal to the first byte of the sequence. Tests are ok.
>>
>> Can we use UTF-8-specific knowledge to accelerate the lookup?
>>
>> For example, you know that E0 is always the start of a 3-byte UTF-8 sequence.
>>
>> Could you do two checks?
>>
>> (a) Are we in a 1, 2, 3, 4, 5, or 6 bytes sequence?
>> (b) If in a 1 byte sequence use a one-byte table.
>> (c) If in a 2-6 byte sequence use the hash-table?
>
> I tried it, but the switches necessary to make that work in locale/weight.h findidx() slow down the lookup (e.g. 5% for
> en_US.UTF-8 file listing). Because of the paddings of the collation file it does not save space either. So I would not
> take that path.
>
>>>> Will this always work? I'm just wondering about a user generated charmap that they
>>>> call 'utf8', which is the other common alias for instance where the dash is not valid
>>>> syntax. Probably not since the official name is UTF-8, and that's what you should use.
>>>
>>> Well, if it does not work it's just a speed penalty. But there is no problem in adding a check for "utf8".
>>
>> Could you check to see what value 'code_set_name' uses internally?
>>
>> Is it always 'UTF-8'? If it is, then the check you have is just fine.
>>
>> Otherwise we should check for utf8 and UTF-8.
>>
>> We don't know until you verify the values code_set_name can have.
>
> The source of code_set_name is the charmap file (localedata/charmaps/UTF-8) and there it is defined as UTF-8.
>
>>> diff --git a/locale/C-collate.c b/locale/C-collate.c
>>> index 8214ff5..5a9ed6a 100644
>>> --- a/locale/C-collate.c
>>> +++ b/locale/C-collate.c
>>> @@ -144,6 +144,8 @@ const struct __locale_data _nl_C_LC_COLLATE attribute_hidden =
>>> /* _NL_COLLATE_COLLSEQWC */
>>> { .string = (const char *) collseqwc },
>>> /* _NL_COLLATE_CODESET */
>>> - { .string = _nl_C_codeset }
>>> + { .string = _nl_C_codeset },
>>> + /* _NL_COLLATE_ENCODING_TYPE */
>>> + { .word = __cet_8bit }
>>
>> This makes locale-archive incompatible again right?
>>
>> Users have to regenerate the locale-archive after the upgrade?
>>
>> We need a release note for that under "Packaging Changes"
>> https://sourceware.org/glibc/wiki/Release/2.24#Packaging_Changes
>>
>> The release note should mention the binary locale-archive format
>> has changed and that locale-archive must be removed before upgrading
>> and then recompiled after upgrade.
>
> Yes. Who should do it?
>
> Best,
> Leonhard
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PING][PATCH V4][BZ #18441] fix sorting multibyte charsets with an improper locale
2016-05-23 7:37 [PATCH V4][BZ #18441] fix sorting multibyte charsets with an improper locale Leonhard Holz
2016-06-17 5:39 ` Leonhard Holz
@ 2016-06-27 17:22 ` Leonhard Holz
2016-07-12 6:50 ` Leonhard Holz
2016-08-12 4:30 ` Leonhard Holz
3 siblings, 0 replies; 6+ messages in thread
From: Leonhard Holz @ 2016-06-27 17:22 UTC (permalink / raw)
To: GNU C Library; +Cc: Carlos O'Donell
Am 22.05.2016 um 17:52 schrieb Leonhard Holz:
> Hi Carlos,
>
>>> The approach of the patch is to interprete TABLEMB as a hashtable and find a
>>> better hash key. My first try was to somehow "fold" a multibyte character into one
>>> byte but that worsened the overall performance a lot. Enhancing the table to 2
>>> byte keys works much better while needing a reasonable amount of extra memory.
>>
>> How much memory on average?
>>
>> Worst case is all possible 2-byte hash values?
>>
>> e.g. 256*256*(sizeof(struct element_t)) == ~10MB?
>>
>> Best case is the cost of the NULL pointer, so 256*256*8 = 0.5MB?
>
> TABLEMB is as an int32_t array, so it's 256*256*4 = 256k.
>
>>> The patch vastly improves the performance of languages with multibyte chars (see
>>> zh_CN, hi_IN and ja_JP below). A side effect is that some languages with one-byte chars
>>> get a bit slower because of the extra check for the first byte while finding the right
>>> sequence in the sequence list . It cannot be avoided since the hash key is not
>>> longer equal to the first byte of the sequence. Tests are ok.
>>
>> Can we use UTF-8-specific knowledge to accelerate the lookup?
>>
>> For example, you know that E0 is always the start of a 3-byte UTF-8 sequence.
>>
>> Could you do two checks?
>>
>> (a) Are we in a 1, 2, 3, 4, 5, or 6 bytes sequence?
>> (b) If in a 1 byte sequence use a one-byte table.
>> (c) If in a 2-6 byte sequence use the hash-table?
>
> I tried it, but the switches necessary to make that work in locale/weight.h findidx() slow down the lookup (e.g. 5% for
> en_US.UTF-8 file listing). Because of the paddings of the collation file it does not save space either. So I would not
> take that path.
>
>>>> Will this always work? I'm just wondering about a user generated charmap that they
>>>> call 'utf8', which is the other common alias for instance where the dash is not valid
>>>> syntax. Probably not since the official name is UTF-8, and that's what you should use.
>>>
>>> Well, if it does not work it's just a speed penalty. But there is no problem in adding a check for "utf8".
>>
>> Could you check to see what value 'code_set_name' uses internally?
>>
>> Is it always 'UTF-8'? If it is, then the check you have is just fine.
>>
>> Otherwise we should check for utf8 and UTF-8.
>>
>> We don't know until you verify the values code_set_name can have.
>
> The source of code_set_name is the charmap file (localedata/charmaps/UTF-8) and there it is defined as UTF-8.
>
>>> diff --git a/locale/C-collate.c b/locale/C-collate.c
>>> index 8214ff5..5a9ed6a 100644
>>> --- a/locale/C-collate.c
>>> +++ b/locale/C-collate.c
>>> @@ -144,6 +144,8 @@ const struct __locale_data _nl_C_LC_COLLATE attribute_hidden =
>>> /* _NL_COLLATE_COLLSEQWC */
>>> { .string = (const char *) collseqwc },
>>> /* _NL_COLLATE_CODESET */
>>> - { .string = _nl_C_codeset }
>>> + { .string = _nl_C_codeset },
>>> + /* _NL_COLLATE_ENCODING_TYPE */
>>> + { .word = __cet_8bit }
>>
>> This makes locale-archive incompatible again right?
>>
>> Users have to regenerate the locale-archive after the upgrade?
>>
>> We need a release note for that under "Packaging Changes"
>> https://sourceware.org/glibc/wiki/Release/2.24#Packaging_Changes
>>
>> The release note should mention the binary locale-archive format
>> has changed and that locale-archive must be removed before upgrading
>> and then recompiled after upgrade.
>
> Yes. Who should do it?
>
> Best,
> Leonhard
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PING][PATCH V4][BZ #18441] fix sorting multibyte charsets with an improper locale
2016-05-23 7:37 [PATCH V4][BZ #18441] fix sorting multibyte charsets with an improper locale Leonhard Holz
2016-06-17 5:39 ` Leonhard Holz
2016-06-27 17:22 ` [PING][PATCH " Leonhard Holz
@ 2016-07-12 6:50 ` Leonhard Holz
2016-08-12 4:30 ` Leonhard Holz
3 siblings, 0 replies; 6+ messages in thread
From: Leonhard Holz @ 2016-07-12 6:50 UTC (permalink / raw)
To: GNU C Library; +Cc: Carlos O'Donell
Am 22.05.2016 um 17:52 schrieb Leonhard Holz:
> Hi Carlos,
>
>>> The approach of the patch is to interprete TABLEMB as a hashtable and find a
>>> better hash key. My first try was to somehow "fold" a multibyte character into one
>>> byte but that worsened the overall performance a lot. Enhancing the table to 2
>>> byte keys works much better while needing a reasonable amount of extra memory.
>>
>> How much memory on average?
>>
>> Worst case is all possible 2-byte hash values?
>>
>> e.g. 256*256*(sizeof(struct element_t)) == ~10MB?
>>
>> Best case is the cost of the NULL pointer, so 256*256*8 = 0.5MB?
>
> TABLEMB is as an int32_t array, so it's 256*256*4 = 256k.
>
>>> The patch vastly improves the performance of languages with multibyte chars (see
>>> zh_CN, hi_IN and ja_JP below). A side effect is that some languages with one-byte chars
>>> get a bit slower because of the extra check for the first byte while finding the right
>>> sequence in the sequence list . It cannot be avoided since the hash key is not
>>> longer equal to the first byte of the sequence. Tests are ok.
>>
>> Can we use UTF-8-specific knowledge to accelerate the lookup?
>>
>> For example, you know that E0 is always the start of a 3-byte UTF-8 sequence.
>>
>> Could you do two checks?
>>
>> (a) Are we in a 1, 2, 3, 4, 5, or 6 bytes sequence?
>> (b) If in a 1 byte sequence use a one-byte table.
>> (c) If in a 2-6 byte sequence use the hash-table?
>
> I tried it, but the switches necessary to make that work in locale/weight.h findidx() slow down the lookup (e.g. 5% for
> en_US.UTF-8 file listing). Because of the paddings of the collation file it does not save space either. So I would not
> take that path.
>
>>>> Will this always work? I'm just wondering about a user generated charmap that they
>>>> call 'utf8', which is the other common alias for instance where the dash is not valid
>>>> syntax. Probably not since the official name is UTF-8, and that's what you should use.
>>>
>>> Well, if it does not work it's just a speed penalty. But there is no problem in adding a check for "utf8".
>>
>> Could you check to see what value 'code_set_name' uses internally?
>>
>> Is it always 'UTF-8'? If it is, then the check you have is just fine.
>>
>> Otherwise we should check for utf8 and UTF-8.
>>
>> We don't know until you verify the values code_set_name can have.
>
> The source of code_set_name is the charmap file (localedata/charmaps/UTF-8) and there it is defined as UTF-8.
>
>>> diff --git a/locale/C-collate.c b/locale/C-collate.c
>>> index 8214ff5..5a9ed6a 100644
>>> --- a/locale/C-collate.c
>>> +++ b/locale/C-collate.c
>>> @@ -144,6 +144,8 @@ const struct __locale_data _nl_C_LC_COLLATE attribute_hidden =
>>> /* _NL_COLLATE_COLLSEQWC */
>>> { .string = (const char *) collseqwc },
>>> /* _NL_COLLATE_CODESET */
>>> - { .string = _nl_C_codeset }
>>> + { .string = _nl_C_codeset },
>>> + /* _NL_COLLATE_ENCODING_TYPE */
>>> + { .word = __cet_8bit }
>>
>> This makes locale-archive incompatible again right?
>>
>> Users have to regenerate the locale-archive after the upgrade?
>>
>> We need a release note for that under "Packaging Changes"
>> https://sourceware.org/glibc/wiki/Release/2.24#Packaging_Changes
>>
>> The release note should mention the binary locale-archive format
>> has changed and that locale-archive must be removed before upgrading
>> and then recompiled after upgrade.
>
> Yes. Who should do it?
>
> Best,
> Leonhard
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PING][PATCH V4][BZ #18441] fix sorting multibyte charsets with an improper locale
2016-05-23 7:37 [PATCH V4][BZ #18441] fix sorting multibyte charsets with an improper locale Leonhard Holz
` (2 preceding siblings ...)
2016-07-12 6:50 ` Leonhard Holz
@ 2016-08-12 4:30 ` Leonhard Holz
3 siblings, 0 replies; 6+ messages in thread
From: Leonhard Holz @ 2016-08-12 4:30 UTC (permalink / raw)
To: GNU C Library; +Cc: Carlos O'Donell
> Hi Carlos,
>
>>> The approach of the patch is to interprete TABLEMB as a hashtable and find a
>>> better hash key. My first try was to somehow "fold" a multibyte character into one
>>> byte but that worsened the overall performance a lot. Enhancing the table to 2
>>> byte keys works much better while needing a reasonable amount of extra memory.
>>
>> How much memory on average?
>>
>> Worst case is all possible 2-byte hash values?
>>
>> e.g. 256*256*(sizeof(struct element_t)) == ~10MB?
>>
>> Best case is the cost of the NULL pointer, so 256*256*8 = 0.5MB?
>
> TABLEMB is as an int32_t array, so it's 256*256*4 = 256k.
>
>>> The patch vastly improves the performance of languages with multibyte chars (see
>>> zh_CN, hi_IN and ja_JP below). A side effect is that some languages with one-byte chars
>>> get a bit slower because of the extra check for the first byte while finding the right
>>> sequence in the sequence list . It cannot be avoided since the hash key is not
>>> longer equal to the first byte of the sequence. Tests are ok.
>>
>> Can we use UTF-8-specific knowledge to accelerate the lookup?
>>
>> For example, you know that E0 is always the start of a 3-byte UTF-8 sequence.
>>
>> Could you do two checks?
>>
>> (a) Are we in a 1, 2, 3, 4, 5, or 6 bytes sequence?
>> (b) If in a 1 byte sequence use a one-byte table.
>> (c) If in a 2-6 byte sequence use the hash-table?
>
> I tried it, but the switches necessary to make that work in locale/weight.h findidx() slow down the lookup (e.g. 5% for
> en_US.UTF-8 file listing). Because of the paddings of the collation file it does not save space either. So I would not
> take that path.
>
>>>> Will this always work? I'm just wondering about a user generated charmap that they
>>>> call 'utf8', which is the other common alias for instance where the dash is not valid
>>>> syntax. Probably not since the official name is UTF-8, and that's what you should use.
>>>
>>> Well, if it does not work it's just a speed penalty. But there is no problem in adding a check for "utf8".
>>
>> Could you check to see what value 'code_set_name' uses internally?
>>
>> Is it always 'UTF-8'? If it is, then the check you have is just fine.
>>
>> Otherwise we should check for utf8 and UTF-8.
>>
>> We don't know until you verify the values code_set_name can have.
>
> The source of code_set_name is the charmap file (localedata/charmaps/UTF-8) and there it is defined as UTF-8.
>
>>> diff --git a/locale/C-collate.c b/locale/C-collate.c
>>> index 8214ff5..5a9ed6a 100644
>>> --- a/locale/C-collate.c
>>> +++ b/locale/C-collate.c
>>> @@ -144,6 +144,8 @@ const struct __locale_data _nl_C_LC_COLLATE attribute_hidden =
>>> /* _NL_COLLATE_COLLSEQWC */
>>> { .string = (const char *) collseqwc },
>>> /* _NL_COLLATE_CODESET */
>>> - { .string = _nl_C_codeset }
>>> + { .string = _nl_C_codeset },
>>> + /* _NL_COLLATE_ENCODING_TYPE */
>>> + { .word = __cet_8bit }
>>
>> This makes locale-archive incompatible again right?
>>
>> Users have to regenerate the locale-archive after the upgrade?
>>
>> We need a release note for that under "Packaging Changes"
>> https://sourceware.org/glibc/wiki/Release/2.24#Packaging_Changes
>>
>> The release note should mention the binary locale-archive format
>> has changed and that locale-archive must be removed before upgrading
>> and then recompiled after upgrade.
>
> Yes. Who should do it?
>
> Best,
> Leonhard
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PING][PATCH V4][BZ #18441] fix sorting multibyte charsets with an improper locale
@ 2016-03-20 21:13 Leonhard Holz
0 siblings, 0 replies; 6+ messages in thread
From: Leonhard Holz @ 2016-03-20 21:13 UTC (permalink / raw)
To: GNU C Library
Please review https://sourceware.org/ml/libc-alpha/2016-02/msg00874.html.
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2016-08-12 4:30 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-23 7:37 [PATCH V4][BZ #18441] fix sorting multibyte charsets with an improper locale Leonhard Holz
2016-06-17 5:39 ` Leonhard Holz
2016-06-27 17:22 ` [PING][PATCH " Leonhard Holz
2016-07-12 6:50 ` Leonhard Holz
2016-08-12 4:30 ` Leonhard Holz
-- strict thread matches above, loose matches on Subject: below --
2016-03-20 21:13 Leonhard Holz
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).