From: Leonhard Holz <leonhard.holz@web.de>
To: libc-alpha@sourceware.org
Subject: [PING^3][PATCH V3][BZ #18441] fix sorting multibyte charsets with an improper locale
Date: Thu, 30 Jul 2015 05:55:00 -0000 [thread overview]
Message-ID: <55B9BC51.90603@web.de> (raw)
In-Reply-To: <55AD63EC.3030504@web.de>
Ping!
Am 20.07.2015 um 23:11 schrieb Leonhard Holz:
> Ping!
>
> Am 13.07.2015 um 10:25 schrieb Leonhard Holz:
>> Ping!
>>
>> Am 06.07.2015 um 23:39 schrieb Leonhard Holz:
>>> Patch v3: Replace _NL_CURRENT with _NL_CURRENT_WORD for reading the encoding.
>>> Patch v2: Use the UTF-8 to codepoint conversion proposed by OndÅej.
>>>
>>> In BZ #18441 sorting a thai text with the en_US.UTF-8 locale causes a performance
>>> regression. The cause of the problem is that
>>>
>>> a) en_US.UTF-8 has no informations for thai chars and so always reports a zero
>>> sort weight which causes the comparison to check the whole string instead of
>>> breaking up early and
>>>
>>> b) the sequence-to-weight list is partitioned by the first byte of the first
>>> character (TABLEMB); this generates long lists for multibyte UTF-8 characters as
>>> they tend to have an equal starting byte (e.g. all thai chars start with E0).
>>>
>>> The approach of the patch is to interprete TABLEMB as a hashtable and find a
>>> better hash key. My first try was to somehow "fold" a multibyte character into one
>>> byte but that worsened the overall performance a lot. Enhancing the table to 2
>>> byte keys works much better while needing a reasonable amount of extra memory.
>>>
>>> The patch vastly improves the performance of languages with multibyte chars (see
>>> zh_CN, hi_IN and ja_JP below). A side effect is that some languages with one-byte chars
>>> get a bit slower because of the extra check for the first byte while finding the right
>>> sequence in the sequence list . It cannot be avoided since the hash key is not
>>> longer equal to the first byte of the sequence. Tests are ok.
>>>
>>> filelist#C 1.73%
>>> filelist#en_US.UTF-8 0.54%
>>> lorem_ipsum#vi_VN.UTF-8 1.90%
>>> lorem_ipsum#ar_SA.UTF-8 -12.06%
>>> lorem_ipsum#en_US.UTF-8 1.15%
>>> lorem_ipsum#zh_CN.UTF-8 -86.32%
>>> lorem_ipsum#cs_CZ.UTF-8 -11.42%
>>> lorem_ipsum#en_GB.UTF-8 - 3.09%
>>> lorem_ipsum#da_DK.UTF-8 6.70%
>>> lorem_ipsum#pl_PL.UTF-8 - 1.04%
>>> lorem_ipsum#fr_FR.UTF-8 - 1.22%
>>> lorem_ipsum#pt_PT.UTF-8 0.47%
>>> lorem_ipsum#el_GR.UTF-8 -29.40%
>>> lorem_ipsum#ru_RU.UTF-8 -11.79%
>>> lorem_ipsum#iw_IL.UTF-8 - 1.39%
>>> lorem_ipsum#es_ES.UTF-8 3.91%
>>> lorem_ipsum#hi_IN.UTF-8 -98.26%
>>> lorem_ipsum#sv_SE.UTF-8 5.61%
>>> lorem_ipsum#hu_HU.UTF-8 15.32%
>>> lorem_ipsum#tr_TR.UTF-8 - 3.51%
>>> lorem_ipsum#is_IS.UTF-8 5.62%
>>> lorem_ipsum#it_IT.UTF-8 -05.97%
>>> lorem_ipsum#sr_RS.UTF-8 -01.19%
>>> lorem_ipsum#ja_JP.UTF-8 -98.11%
>>> wikipedia-th#en_US.UTF-8 -99.63%
>>>
>>>
>>> * locale/programs/ld-collate.c (struct locale_collate_t):
>>> Expand mbheads array from 256 to 16384 entries.
>>> (collate_finish): Generate 2-byte key for mbheads if UTF-8 locale.
>>> (collate_output): Output larger table and sequences including first byte.
>>> * locale/weight.h (findidx): Use 2-byte key for table if UTF-8 locale.
>>> * locale/weightwc.h (findidx): Accept encoding parameter, not used.
>>> * posix/fnmatch_loop.c (FCT): Call findidx with encoding parameter.
>>> * posix/regcomp.c (build_equiv_class): Likewise.
>>> * posix/regex_internal.h (re_string_elem_size_at): Likewise.
>>> * posix/regexec.c (check_node_accept_bytes): Likewise.
>>> * string/strcoll_l.c (get_next_seq): Likewise.
>>> (STRCOLL): Call get_next_seq with encoding parameter.
>>> * string/strxfrm_l.c (find_idx): Call findidx with encoding parameter.
>>> (STRXFRM): Call find_idx with encoding parameter.
>>>
next prev parent reply other threads:[~2015-07-30 5:55 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-07-06 21:39 [PATCH " Leonhard Holz
2015-07-13 8:26 ` [PING][PATCH " Leonhard Holz
2015-07-20 21:11 ` [PING^2][PATCH " Leonhard Holz
2015-07-30 5:55 ` Leonhard Holz [this message]
2015-07-31 3:58 ` [PING^3][PATCH " Carlos O'Donell
2015-10-07 16:13 ` [PATCH " Carlos O'Donell
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=55B9BC51.90603@web.de \
--to=leonhard.holz@web.de \
--cc=libc-alpha@sourceware.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).