From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from m15112.mail.126.com (m15112.mail.126.com [220.181.15.112]) by sourceware.org (Postfix) with ESMTPS id 599D8385040F for ; Sun, 24 Jan 2021 08:10:37 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 599D8385040F Received: from [192.168.1.3] (unknown [222.67.199.108]) by smtp2 (Coremail) with SMTP id DMmowACnrLBxKw1g+ALjLQ--.61995S2; Sun, 24 Jan 2021 16:10:27 +0800 (CST) To: mingw-w64-public@lists.sourceforge.net, libstdc++@gcc.gnu.org Cc: Hannes Domani , Jonathan Wakely References: <778019458.8796650.1611425106252@mail.yahoo.com> From: Liu Hao Subject: Re: [Mingw-w64-public] std::regex freezes in Japanese locale Message-ID: Date: Sun, 24 Jan 2021 16:10:24 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.6.1 MIME-Version: 1.0 In-Reply-To: <778019458.8796650.1611425106252@mail.yahoo.com> Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="rQEA7U4KVCUTIpdfYFy5dLlYoWhbNpOwS" X-CM-TRANSID: DMmowACnrLBxKw1g+ALjLQ--.61995S2 X-Coremail-Antispam: 1Uf129KBjvJXoW7tw47Xr1DJr15KFW5Zr4UCFg_yoW8Cry3pr W7Wa9xKrsYgaykAF1avw1UKry8tF47tw18uryYgFn8uF90y3s2gF4IkrW2vasxuw4xZFW2 kay2gr90va1qyFJanT9S1TB71UUUUUUqnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDUYxBIdaVFxhVjvjDU0xZFpf9x07jviihUUUUU= X-Originating-IP: [222.67.199.108] X-CM-SenderInfo: 5okbz0xxvhqiyswou0bp/1tbi8xMkRlpc5vbdgwAAsK X-Spam-Status: No, score=-3124.7 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, NICE_REPLY_A, RCVD_IN_BARRACUDACENTRAL, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=no autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libstdc++@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libstdc++ mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 24 Jan 2021 08:10:40 -0000 This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --rQEA7U4KVCUTIpdfYFy5dLlYoWhbNpOwS Content-Type: multipart/mixed; boundary="OVCrvVNHLLBMt60n8zNZbpguVZfC5OW9m"; protected-headers="v1" From: Liu Hao To: mingw-w64-public@lists.sourceforge.net, libstdc++@gcc.gnu.org Cc: Hannes Domani , Jonathan Wakely Message-ID: Subject: Re: [Mingw-w64-public] std::regex freezes in Japanese locale References: <778019458.8796650.1611425106252@mail.yahoo.com> In-Reply-To: <778019458.8796650.1611425106252@mail.yahoo.com> --OVCrvVNHLLBMt60n8zNZbpguVZfC5OW9m Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: quoted-printable =E5=9C=A8 2021-01-24 02:05, Hannes Domani via Mingw-w64-public =E5=86=99=E9= =81=93: > Am Samstag, 23. Januar 2021, 16:46:18 MEZ hat Jeroen Ooms Folgendes geschrieben: >=20 >> A user of the R programming language has reported that std::regex >> causes a hang for certain regular expressions when running in Japanese= >> locale. I was able to reproduce this both with our production >> toolchain (mingw-w64 v5 + gcc 8) as well as the latest msys2 >> toolchains. >> >> Is this a bug in mingw-w64 or elsewhere? Below a minimal example: >> >> #include >> int main() { >> =C2=A0=C2=A0 setlocale(LC_ALL, "Japanese"); >> =C2=A0=C2=A0 std::regex reg("[0-9]"); >> =C2=A0=C2=A0 return 0; >> } >=20 > I can reproduce this as well, it took 108 seconds to finish here. >=20 > Deep in regex is this function: > std::__detail::_BracketMatcher, false,= false>::_M_make_cache(std::integral_constant) >=20 > This caches transformed values of the unicode values 0-255 to the curre= nt > locale, with strxfrm_l [1]. > This fails for a lot of them for japanese, and as documented, strxfrm_l= > returns INT_MAX in this case. > But std::collate::do_transform does not handle any error case, it uses = all > return values as the length of the transformed string. > And then it creates a copy of this 2GB string, which takes a lot of tim= e, > around ~1s for each failing character. >=20 > It think this should be reported to gcc (libstdc++). >=20 >=20 > [1] https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/st= rxfrm-wcsxfrm-strxfrm-l-wcsxfrm-l?view=3Dmsvc-160 >=20 >=20 Add CC libstdc++ and jwakely. Despite Microsoft docs, the standard `_?(str|wcs)xfrm(_l)?` functions don= 't have return values to=20 indicate errors. This issue seems to be caused by invalid MBCSs passed to= `_strxfrm_l`, which should=20 be avoided. --=20 Best regards, LH_Mouse --OVCrvVNHLLBMt60n8zNZbpguVZfC5OW9m-- --rQEA7U4KVCUTIpdfYFy5dLlYoWhbNpOwS Content-Type: application/pgp-signature; name="OpenPGP_signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="OpenPGP_signature" -----BEGIN PGP SIGNATURE----- wsF5BAABCAAjFiEEn9eDGptREvfrWRePQbMyeCIxW8QFAmANK3AFAwAAAAAACgkQQbMyeCIxW8SW aA//fLSoLGX/EP4xxO7o4mziCS+vZyzZgLrCqqxVTVz7wWiDjD6OaaYxpJdsfrKKNuPoci3QDtTY PExpmUMLM81JIrk+ZylM7iyk7WWvA68nVEROH7AnzcQaRV5RjloenhHeabun67QqPZw5it399tKT N7uyODOR8hUOh5WNtBhHnz19r0OTXuzGQzHSLvHWWQAe1pscnVDQWL2nb0t1Wm81Y+Y7fbYhjQix AEbira/WLfGGDyQCB0tVHvtDhQ7YRNitsNjRYxB6boDvd1GmOC2wkHKY9j2UYinX8j+rC3nymIUM hO574dkWzTiRhJS5it/pSo8qwdeFU/IMFTZrqLNSa4Co1Hm9ibWb23v5fDKBM8bfhoKvpEOv89C3 crQGU6UefSKglz2EVTKVXMUxleRoIUd0EeV1LbhMmjkHnC9zxNANvLY6jR7W0hRFRP14UZbWuleJ iyEA9/OOvfsFa0rHlODs0f24hK6brqpVL1g0N/1mFOZvXy4n42U2cJLWDEwNH0jXTm26KyBcw+Yn DQcv+q3pmnFCHBJS3UH7+6rgupIX/Vcl0MtmzfFmCRNNfHswDiKmkAORY6eNyZaITjpXU+Eim+Yz 0Ja/sCfqjAzuGIjBZFMchDLJRo/7oYH8+1TeqnrHaLBtHSSs4HJHBqhGyqWYDSv8ubiaKhk+P5aj rtg= =tael -----END PGP SIGNATURE----- --rQEA7U4KVCUTIpdfYFy5dLlYoWhbNpOwS--