From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <vapier@gentoo.org>
Received: from smtp.gentoo.org (smtp.gentoo.org [140.211.166.183])
 by sourceware.org (Postfix) with ESMTP id 3290F3851C18
 for <libc-alpha@sourceware.org>; Wed, 14 Apr 2021 20:28:13 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 3290F3851C18
Received: from vapier (localhost [127.0.0.1])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by smtp.gentoo.org (Postfix) with ESMTPS id 21E3C340DBB;
 Wed, 14 Apr 2021 20:28:12 +0000 (UTC)
Date: Wed, 14 Apr 2021 16:28:11 -0400
From: Mike Frysinger <vapier@gentoo.org>
To: Paul Eggert <eggert@cs.ucla.edu>
Cc: Joseph Myers <joseph@codesourcery.com>,
 GNU C Library <libc-alpha@sourceware.org>
Subject: Re: UTF-8 in glibc commit messages
Message-ID: <YHdQW5deN3vge2bu@vapier>
Mail-Followup-To: Paul Eggert <eggert@cs.ucla.edu>,
 Joseph Myers <joseph@codesourcery.com>,
 GNU C Library <libc-alpha@sourceware.org>
References: <a5c371a5-48ea-1eb4-719a-0861d6797938@cs.ucla.edu>
 <alpine.DEB.2.22.394.2104132016550.30343@digraph.polyomino.org.uk>
 <5e9119c5-afe3-d8f1-a732-8aa2f4955c3b@cs.ucla.edu>
 <YHcDsTB/pSUnb2l0@vapier>
 <4878f4cd-529d-fa8a-6394-d7ae6a69c824@cs.ucla.edu>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <4878f4cd-529d-fa8a-6394-d7ae6a69c824@cs.ucla.edu>
X-Spam-Status: No, score=-4.5 required=5.0 tests=BAYES_00, KAM_DMARC_STATUS,
 RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_PASS, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Wed, 14 Apr 2021 20:28:14 -0000

On 14 Apr 2021 11:08, Paul Eggert wrote:
> On 4/14/21 8:01 AM, Mike Frysinger wrote:
> > can't we be proactive ?  let's go all-in on UTF-8.
> 
> A problem with "all-in" is that UTF-8 has weird characters that can mess 
> things up. The commit message check was originally put in because 
> someone copy-pasted U+2069 POP DIRECTIONAL ISOLATE into a commit message 
> without realizing it. That invisible character breaks simple searches 
> like 'grep -w'.

arguably seems like a missing feature in grep that the user cannot express
word searches that match graphemes.  but we'll prob get sidetracked into
the weeds with that discussion.

> glibc's current check isn't quite right either, as it allows lines like 
> this:
> 
>      Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
> 
> in which each "space" is actually U+00A0 NO-BREAK SPACE. Although that's 
> valid ISO-8895-15, U+00A0 is another weird character that we arguably 
> shouldn't allow as it can also mess up searches (it's even blacklisted 
> in URLs by some browsers because of the potential for phishing).
> 
> It'd be better to come up with an exact list of acceptable Unicode 
> characters (probably a set of categories with some exceptions). This 
> would be better than the current approach which is either too-generous 
> or (mostly) too-restrictive. But it'd be some work.

it seems like the concern is over accidental things being copied & pasted
(or generated) in a commit message vs them never being used.  if we tried
to ban e.g. all combining characters, that'd implicitly ban use of many
scripts in the commit message, even if they were being used intentionally.

to that end, rather than try and come up with a sloppy policy that requires
constant care & feeding, why not go with a hook that devs can override ?
so with the commit above where U+00A0 was used by accident, the push would
be rejected with something like:
(i'm sure there's prior art out there we could reuse)
remote: Non-ASCII character found in commit message:
remote: line 1234: Reviewed-by:\u00A0Adhemerval Zanella  <adhemerval.zanella@linaro.org>
remote:                        ^
remote: If this was not a mistake, add "-o bypass-commit-encoding-check" to bypass this check.

that should catch all the accidental usage while still allowing people to
double check and say "i meant to use these".
-mike