From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-alpha-return-90547-listarch-libc-alpha=sources.redhat.com@sourceware.org>
Received: (qmail 50933 invoked by alias); 24 Feb 2018 06:16:27 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Received: (qmail 50917 invoked by uid 89); 24 Feb 2018 06:16:26 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-24.3 required=5.0 tests=AWL,BAYES_00,GIT_PATCH_0,GIT_PATCH_1,GIT_PATCH_2,GIT_PATCH_3,KAM_NUMSUBJECT,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.2 spammy=z
X-HELO: mail-qt0-f170.google.com
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:subject:to:cc:references:from:organization
         :message-id:date:user-agent:mime-version:in-reply-to
         :content-language:content-transfer-encoding;
        bh=j5eG+CpwgL+gF0iBUoZKhxLKZmnISpFQMXWYot38IYg=;
        b=nNxpyzxVD4uKdduEFU/l4SFjv+8pWrEZbMWiuOczvpBxbpi3F8te3Drrx7wRmaOg2T
         Y5ZhBk4tCbiaR+aGY/7JSZP9v3tJPVnz1T5Aq6JAcdlo0jCnRVxJBKiAp8V02LJqaocz
         FEryh6+9fkqkiV431y8A4uO/qcSe+LwsGjLirdWDlxYFPJg6oRyQkBlOhSbptuOp7njY
         PxWkaY+jGUPVevDjbouQaf7lYhrKn/rtGBMwPerRi7hWQjblOZ4IIeDOZ4folXjOFbsR
         g9t5gXjPK7/ke3m/gHfdTwWR9OfJsJvRXwN9sDKgvZVrrc+H9II5wUspfOpEZSkUTsBK
         neBw==
X-Gm-Message-State: APf1xPCl+yOabf1X58E+Ru7yV3+c5XfWt5zhF65e6eFrJBnMASMEfWMM
	ihn6lkTFW+X4PdyTqcdlzjvx2g==
X-Google-Smtp-Source: AG47ELuvnAofPDWVY7I61cLzg6VAGs+smYYnL/FS4vKnXi4aaXn+KmI/97ivRUXzb7B9UxAcTQtoDw==
X-Received: by 10.237.54.35 with SMTP id e32mr6460920qtb.322.1519452983092;
        Fri, 23 Feb 2018 22:16:23 -0800 (PST)
Subject: Re: [Patch v3 11/14] [BZ #14095] update collation data from Unicode /
 ISO 14651
To: Mike FABIAN <mfabian@redhat.com>, libc-alpha@sourceware.org
Cc: "Dmitry V. Levin" <ldv@altlinux.org>
References: <s9da7w0c8bc.fsf@taka.site>
From: Carlos O'Donell <carlos@redhat.com>
Message-ID: <41c1f7dc-f603-6dd3-895f-7f755865e4d3@redhat.com>
Date: Sat, 24 Feb 2018 06:21:00 -0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
 Thunderbird/52.6.0
MIME-Version: 1.0
In-Reply-To: <s9da7w0c8bc.fsf@taka.site>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
X-SW-Source: 2018-02/txt/msg00693.txt.bz2

On 02/23/2018 02:24 AM, Mike FABIAN wrote:
> From 5c65168e569ba0c59ad43bbd88f37cdb356c16b6 Mon Sep 17 00:00:00 2001
> From: Mike FABIAN <mfabian@redhat.com>
> Date: Tue, 23 Jan 2018 17:29:36 +0100
> Subject: [PATCH 11/14] Fix test cases tst-fnmatch and tst-regexloc for the new
>  iso14651_t1_common file.
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit

OK with the following changes:

- Comment added in tst-fnmatch.input about range usage like this.
- Rework the test input to keep testing the range.

See comments below.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>

> See:
> 
> http://pubs.opengroup.org/onlinepubs/7908799/xbd/re.html
> 
>> A range expression represents the set of collating elements that fall
>> between two elements in the current collation sequence,
>> inclusively. It is expressed as the starting point and the ending
>> point separated by a hyphen (-).
>>
>> Range expressions must not be used in portable applications because
>> their behaviour is dependent on the collating sequence. Ranges will be
>> treated according to the current collating sequence, and include such
>> characters that fall within the range based on that collating
>> sequence, regardless of character values. This, however, means that
>> the interpretation will differ depending on collating sequence. If,
>> for instance, one collating sequence defines ÃÂ¤ as a variant of a,
>> while another defines it as a letter following z, then the expression
>> [ÃÂ¤-z] is valid in the first language and invalid in the second.
> Therefore, using [a-z] does not make much sense except in the C/POSIX locale.
> The new iso14651_t1_common lists upper case and  lower case Latin characters
> in a different order than the old one which causes surprising results
> for example in the de_DE locale: [a-z] now includes A because A comes
> after a in iso14651_t1_common but does not include Z because that comes
> after z in iso14651_t1_common.

Why delete the tests though? Why not adjust them to cover the result?
The old tests were similarly adjusted, since it expects 'Ã¤' to be within
the range of [a-z], similarly we could adjust the tests?

> 
> 	* posix/tst-fnmatch.input: Use range expressions only in C locale.
> 	* posix/tst-regexloc.c: Do not use a range expression for
>         de_DE.ISO-8859-1 locale.
> ---
>  posix/tst-fnmatch.input | 40 ----------------------------------------
>  posix/tst-regexloc.c    |  4 ++--
>  2 files changed, 2 insertions(+), 42 deletions(-)
> 
> diff --git a/posix/tst-fnmatch.input b/posix/tst-fnmatch.input
> index 88b3f739a5..1e2f62c0ed 100644
> --- a/posix/tst-fnmatch.input
> +++ b/posix/tst-fnmatch.input
> @@ -418,26 +418,6 @@ C		"-"			"[Z-\\]]"	       NOMATCH
>  # Following are tests outside the scope of IEEE 2003.2 since they are using
>  # locales other than the C locale.  The main focus of the tests is on the
>  # handling of ranges and the recognition of character (vs bytes).

Here we need a comment explaining exactly why [a-z] is tricky. Basically include
the text you wrote for the commit message here :-)

> -de_DE.ISO-8859-1 "a"			"[a-z]"		       0
> -de_DE.ISO-8859-1 "z"			"[a-z]"		       0
> -de_DE.ISO-8859-1 "Ã¤"			"[a-z]"		       0
> -de_DE.ISO-8859-1 "Ã¶"			"[a-z]"		       0
> -de_DE.ISO-8859-1 "Ã¼"			"[a-z]"		       0
> -de_DE.ISO-8859-1 "A"			"[a-z]"		       NOMATCH

This becomes 0.

> -de_DE.ISO-8859-1 "Z"			"[a-z]"		       NOMATCH

Stays the same.

> -de_DE.ISO-8859-1 "Ã"			"[a-z]"		       NOMATCH
> -de_DE.ISO-8859-1 "Ã"			"[a-z]"		       NOMATCH
> -de_DE.ISO-8859-1 "Ã"			"[a-z]"		       NOMATCH

All become 0.

etc.

> -de_DE.ISO-8859-1 "a"			"[A-Z]"		       NOMATCH
> -de_DE.ISO-8859-1 "z"			"[A-Z]"		       NOMATCH
> -de_DE.ISO-8859-1 "Ã¤"			"[A-Z]"		       NOMATCH
> -de_DE.ISO-8859-1 "Ã¶"			"[A-Z]"		       NOMATCH
> -de_DE.ISO-8859-1 "Ã¼"			"[A-Z]"		       NOMATCH
> -de_DE.ISO-8859-1 "A"			"[A-Z]"		       0
> -de_DE.ISO-8859-1 "Z"			"[A-Z]"		       0
> -de_DE.ISO-8859-1 "Ã"			"[A-Z]"		       0
> -de_DE.ISO-8859-1 "Ã"			"[A-Z]"		       0
> -de_DE.ISO-8859-1 "Ã"			"[A-Z]"		       0
>  de_DE.ISO-8859-1 "a"			"[[:lower:]]"	       0
>  de_DE.ISO-8859-1 "z"			"[[:lower:]]"	       0
>  de_DE.ISO-8859-1 "Ã¤"			"[[:lower:]]"	       0
> @@ -510,26 +490,6 @@ de_DE.ISO-8859-1 "ba"			"[[.a.]]a"	       NOMATCH
>  
>  
>  # And with a multibyte character set.
> -de_DE.UTF-8	 "a"			"[a-z]"		       0
> -de_DE.UTF-8	 "z"			"[a-z]"		       0
> -de_DE.UTF-8	 "ÃÂ¤"			"[a-z]"		       0
> -de_DE.UTF-8	 "ÃÂ¶"			"[a-z]"		       0
> -de_DE.UTF-8	 "ÃÂ¼"			"[a-z]"		       0
> -de_DE.UTF-8	 "A"			"[a-z]"		       NOMATCH
> -de_DE.UTF-8	 "Z"			"[a-z]"		       NOMATCH
> -de_DE.UTF-8	 "Ãâ"			"[a-z]"		       NOMATCH
> -de_DE.UTF-8	 "Ãâ"			"[a-z]"		       NOMATCH
> -de_DE.UTF-8	 "ÃÅ"			"[a-z]"		       NOMATCH
> -de_DE.UTF-8	 "a"			"[A-Z]"		       NOMATCH
> -de_DE.UTF-8	 "z"			"[A-Z]"		       NOMATCH
> -de_DE.UTF-8	 "ÃÂ¤"			"[A-Z]"		       NOMATCH
> -de_DE.UTF-8	 "ÃÂ¶"			"[A-Z]"		       NOMATCH
> -de_DE.UTF-8	 "ÃÂ¼"			"[A-Z]"		       NOMATCH
> -de_DE.UTF-8	 "A"			"[A-Z]"		       0
> -de_DE.UTF-8	 "Z"			"[A-Z]"		       0
> -de_DE.UTF-8	 "Ãâ"			"[A-Z]"		       0
> -de_DE.UTF-8	 "Ãâ"			"[A-Z]"		       0
> -de_DE.UTF-8	 "ÃÅ"			"[A-Z]"		       0
>  de_DE.UTF-8	 "a"			"[[:lower:]]"	       0
>  de_DE.UTF-8	 "z"			"[[:lower:]]"	       0
>  de_DE.UTF-8	 "ÃÂ¤"			"[[:lower:]]"	       0
> diff --git a/posix/tst-regexloc.c b/posix/tst-regexloc.c
> index 60235b4d3b..7fbc496d0c 100644
> --- a/posix/tst-regexloc.c
> +++ b/posix/tst-regexloc.c
> @@ -29,8 +29,8 @@ do_test (void)
>  
>    if (setlocale (LC_ALL, "de_DE.ISO-8859-1") == NULL)
>      puts ("cannot set locale");
> -  else if (regcomp (&re, "[a-f]*", 0) != REG_NOERROR)
> -    puts ("cannot compile expression \"[a-f]*\"");
> +  else if (regcomp (&re, "[abcdef]*", 0) != REG_NOERROR)
> +    puts ("cannot compile expression \"[abcdef]*\"");

OK.

>    else if (regexec (&re, "abcdefCDEF", 1, mat, 0) == REG_NOMATCH)
>      puts ("no match");
>    else
> -- 2.14.3

-- 
Cheers,
Carlos.