UTF-8 in glibc commit messages

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* UTF-8 in glibc commit messages
@ 2021-04-13 19:21 Paul Eggert
  2021-04-13 20:19 ` Joseph Myers
  0 siblings, 1 reply; 9+ messages in thread
From: Paul Eggert @ 2021-04-13 19:21 UTC (permalink / raw)
  To: GNU C Library

Today I tried to install a glibc fix with some non-ASCII chars in the 
commit message, but sourceware.org rejected the fix with the diagnostic 
reproduced at the end of this email.

I'm puzzled by the message, as I see that existing commit messages use 
UTF-8, e.g., 41f013cef24884604c303435dd1915be2ea5c0e0's commit message 
contains the phrase "Developer’s Manual" with the same curly apostrophe 
that my proposed commit used. Has something changed recently on 
sourceware.org to reject such commits?

I assume that UTF-8 is fine in commit messages as we use it to spell 
contributors' names. I worked around this particular problem by 
ASCIIfying the quotes but I'd rather spell people's names nicely in any 
future patches. (It might make sense for the commit hook to rejects some 
troublesome Unicode characters but that's a somewhat different topic.)

Here's the diagnostic I got:

$ git push
Enumerating objects: 21, done.
Counting objects: 100% (21/21), done.
Delta compression using up to 4 threads
Compressing objects: 100% (11/11), done.
Writing objects: 100% (11/11), 3.79 KiB | 242.00 KiB/s, done.
Total 11 (delta 10), reused 0 (delta 0), pack-reused 0
remote: *** Invalid revision history for commit 
371ef6b43fcb5052444c3f0e1ec09d01c5017149:
remote: *** It contains characters not in the ISO-8859-15 charset.
remote: ***
remote: *** Below is the first line where this was detected (line 16):
remote: *** | Use code font for ‘malloc’ instead of roman font.
remote: ***                     ^
remote: ***                     |
remote: ***
remote: *** Please amend the commit's revision history to remove it
remote: *** and try again.
remote: error: hook declined to update refs/heads/master
To sourceware.org:/git/glibc.git
  ! [remote rejected]       master -> master (hook declined)
error: failed to push some refs to 'sourceware.org:/git/glibc.git'

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: UTF-8 in glibc commit messages
  2021-04-13 19:21 UTF-8 in glibc commit messages Paul Eggert
@ 2021-04-13 20:19 ` Joseph Myers
  2021-04-14  0:06   ` Paul Eggert
  0 siblings, 1 reply; 9+ messages in thread
From: Joseph Myers @ 2021-04-13 20:19 UTC (permalink / raw)
  To: Paul Eggert; +Cc: GNU C Library

On Tue, 13 Apr 2021, Paul Eggert wrote:

> Today I tried to install a glibc fix with some non-ASCII chars in the commit
> message, but sourceware.org rejected the fix with the diagnostic reproduced at
> the end of this email.
> 
> I'm puzzled by the message, as I see that existing commit messages use UTF-8,
> e.g., 41f013cef24884604c303435dd1915be2ea5c0e0's commit message contains the
> phrase "Developer’s Manual" with the same curly apostrophe that my proposed
> commit used. Has something changed recently on sourceware.org to reject such
> commits?

At some point the hooks were updated from AdaCore upstream, though that's 
a while ago now.  hooks.no-rh-character-range-check is the relevant 
setting in project.config (on ref refs/meta/config) to disable this check 
(that setting was added to the hooks because I didn't think this check 
would be appropriate for GCC).

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: UTF-8 in glibc commit messages
  2021-04-13 20:19 ` Joseph Myers
@ 2021-04-14  0:06   ` Paul Eggert
  2021-04-14 15:01     ` Mike Frysinger
  0 siblings, 1 reply; 9+ messages in thread
From: Paul Eggert @ 2021-04-14  0:06 UTC (permalink / raw)
  To: Joseph Myers; +Cc: GNU C Library

On 4/13/21 1:19 PM, Joseph Myers wrote:
> hooks.no-rh-character-range-check is the relevant
> setting in project.config (on ref refs/meta/config) to disable this check
> (that setting was added to the hooks because I didn't think this check
> would be appropriate for GCC).

Thanks for the info. If I understand things correctly, commit messages 
can spell contributors' names (using UTF-8), so long as the names can be 
re-encoded into ISO-8859-15. At some point this restriction may become 
more annoying than it currently is, but in the meantime I don't have an 
urgent need to relax the checking.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: UTF-8 in glibc commit messages
  2021-04-14  0:06   ` Paul Eggert
@ 2021-04-14 15:01     ` Mike Frysinger
  2021-04-14 17:41       ` DJ Delorie
  2021-04-14 18:08       ` Paul Eggert
  0 siblings, 2 replies; 9+ messages in thread
From: Mike Frysinger @ 2021-04-14 15:01 UTC (permalink / raw)
  To: Paul Eggert; +Cc: Joseph Myers, GNU C Library

On 13 Apr 2021 17:06, Paul Eggert wrote:
> On 4/13/21 1:19 PM, Joseph Myers wrote:
> > hooks.no-rh-character-range-check is the relevant
> > setting in project.config (on ref refs/meta/config) to disable this check
> > (that setting was added to the hooks because I didn't think this check
> > would be appropriate for GCC).
> 
> Thanks for the info. If I understand things correctly, commit messages 
> can spell contributors' names (using UTF-8), so long as the names can be 
> re-encoded into ISO-8859-15. At some point this restriction may become 
> more annoying than it currently is, but in the meantime I don't have an 
> urgent need to relax the checking.

can't we be proactive ?  let's go all-in on UTF-8.
-mike

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: UTF-8 in glibc commit messages
  2021-04-14 15:01     ` Mike Frysinger
@ 2021-04-14 17:41       ` DJ Delorie
  2021-04-14 18:08       ` Paul Eggert
  1 sibling, 0 replies; 9+ messages in thread
From: DJ Delorie @ 2021-04-14 17:41 UTC (permalink / raw)
  To: Mike Frysinger; +Cc: libc-alpha

Mike Frysinger via Libc-alpha <libc-alpha@sourceware.org> writes:
> can't we be proactive ?  let's go all-in on UTF-8.

+1.  Given the widespread use of utf-8 I think it's a wasted effort to
implement any halfway solutions any more.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: UTF-8 in glibc commit messages
  2021-04-14 15:01     ` Mike Frysinger
  2021-04-14 17:41       ` DJ Delorie
@ 2021-04-14 18:08       ` Paul Eggert
  2021-04-14 18:16         ` Adhemerval Zanella
  2021-04-14 20:28         ` Mike Frysinger
  1 sibling, 2 replies; 9+ messages in thread
From: Paul Eggert @ 2021-04-14 18:08 UTC (permalink / raw)
  To: Joseph Myers, GNU C Library

On 4/14/21 8:01 AM, Mike Frysinger wrote:
> can't we be proactive ?  let's go all-in on UTF-8.

A problem with "all-in" is that UTF-8 has weird characters that can mess 
things up. The commit message check was originally put in because 
someone copy-pasted U+2069 POP DIRECTIONAL ISOLATE into a commit message 
without realizing it. That invisible character breaks simple searches 
like 'grep -w'.

glibc's current check isn't quite right either, as it allows lines like 
this:

     Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>

in which each "space" is actually U+00A0 NO-BREAK SPACE. Although that's 
valid ISO-8895-15, U+00A0 is another weird character that we arguably 
shouldn't allow as it can also mess up searches (it's even blacklisted 
in URLs by some browsers because of the potential for phishing).

It'd be better to come up with an exact list of acceptable Unicode 
characters (probably a set of categories with some exceptions). This 
would be better than the current approach which is either too-generous 
or (mostly) too-restrictive. But it'd be some work.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: UTF-8 in glibc commit messages
  2021-04-14 18:08       ` Paul Eggert
@ 2021-04-14 18:16         ` Adhemerval Zanella
  2021-04-14 18:24           ` Paul Eggert
  2021-04-14 20:28         ` Mike Frysinger
  1 sibling, 1 reply; 9+ messages in thread
From: Adhemerval Zanella @ 2021-04-14 18:16 UTC (permalink / raw)
  To: Paul Eggert, Joseph Myers, GNU C Library



On 14/04/2021 15:08, Paul Eggert wrote:
> On 4/14/21 8:01 AM, Mike Frysinger wrote:
>> can't we be proactive ?  let's go all-in on UTF-8.
> 
> A problem with "all-in" is that UTF-8 has weird characters that can mess things up. The commit message check was originally put in because someone copy-pasted U+2069 POP DIRECTIONAL ISOLATE into a commit message without realizing it. That invisible character breaks simple searches like 'grep -w'.
> 
> glibc's current check isn't quite right either, as it allows lines like this:
> 
>     Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
> 
> in which each "space" is actually U+00A0 NO-BREAK SPACE. Although that's valid ISO-8895-15, U+00A0 is another weird character that we arguably shouldn't allow as it can also mess up searches (it's even blacklisted in URLs by some browsers because of the potential for phishing).

Was I the author?  If so I will check what I am doing wrong with my 
environment.

> 
> It'd be better to come up with an exact list of acceptable Unicode characters (probably a set of categories with some exceptions). This would be better than the current approach which is either too-generous or (mostly) too-restrictive. But it'd be some work.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: UTF-8 in glibc commit messages
  2021-04-14 18:16         ` Adhemerval Zanella
@ 2021-04-14 18:24           ` Paul Eggert
  0 siblings, 0 replies; 9+ messages in thread
From: Paul Eggert @ 2021-04-14 18:24 UTC (permalink / raw)
  To: Adhemerval Zanella, Joseph Myers, GNU C Library

On 4/14/21 11:16 AM, Adhemerval Zanella wrote:
>> glibc's current check isn't quite right either, as it allows lines like this:
>>
>>      Reviewed-by: Adhemerval Zanella<adhemerval.zanella@linaro.org>
>>
>> in which each "space" is actually U+00A0 NO-BREAK SPACE.
> Was I the author?  If so I will check what I am doing wrong with my
> environment.

No, you were the reviewer. The committer was Mao Han. This was commit
56b223c1c8334e4255bf11aed1386a007822702a.

Here are the other glibc commits with U+00A0 issues, and the committers. 
The problem is reasonably rare (4 commits of 37,000) but there it is.

1cf4ae7fe644f5ad37ca82cb432147daf5c8ad77 Leonardo Sandoval
58007e9e68913290b1f4f73afc1055f779a8ed5d H.J. Lu
bd3675f9a3e91edf997c0515f0f1fce1669f038c Ondřej Bílka

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: UTF-8 in glibc commit messages
  2021-04-14 18:08       ` Paul Eggert
  2021-04-14 18:16         ` Adhemerval Zanella
@ 2021-04-14 20:28         ` Mike Frysinger
  1 sibling, 0 replies; 9+ messages in thread
From: Mike Frysinger @ 2021-04-14 20:28 UTC (permalink / raw)
  To: Paul Eggert; +Cc: Joseph Myers, GNU C Library

On 14 Apr 2021 11:08, Paul Eggert wrote:
> On 4/14/21 8:01 AM, Mike Frysinger wrote:
> > can't we be proactive ?  let's go all-in on UTF-8.
> 
> A problem with "all-in" is that UTF-8 has weird characters that can mess 
> things up. The commit message check was originally put in because 
> someone copy-pasted U+2069 POP DIRECTIONAL ISOLATE into a commit message 
> without realizing it. That invisible character breaks simple searches 
> like 'grep -w'.

arguably seems like a missing feature in grep that the user cannot express
word searches that match graphemes.  but we'll prob get sidetracked into
the weeds with that discussion.

> glibc's current check isn't quite right either, as it allows lines like 
> this:
> 
>      Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
> 
> in which each "space" is actually U+00A0 NO-BREAK SPACE. Although that's 
> valid ISO-8895-15, U+00A0 is another weird character that we arguably 
> shouldn't allow as it can also mess up searches (it's even blacklisted 
> in URLs by some browsers because of the potential for phishing).
> 
> It'd be better to come up with an exact list of acceptable Unicode 
> characters (probably a set of categories with some exceptions). This 
> would be better than the current approach which is either too-generous 
> or (mostly) too-restrictive. But it'd be some work.

it seems like the concern is over accidental things being copied & pasted
(or generated) in a commit message vs them never being used.  if we tried
to ban e.g. all combining characters, that'd implicitly ban use of many
scripts in the commit message, even if they were being used intentionally.

to that end, rather than try and come up with a sloppy policy that requires
constant care & feeding, why not go with a hook that devs can override ?
so with the commit above where U+00A0 was used by accident, the push would
be rejected with something like:
(i'm sure there's prior art out there we could reuse)
remote: Non-ASCII character found in commit message:
remote: line 1234: Reviewed-by:\u00A0Adhemerval Zanella  <adhemerval.zanella@linaro.org>
remote:                        ^
remote: If this was not a mistake, add "-o bypass-commit-encoding-check" to bypass this check.

that should catch all the accidental usage while still allowing people to
double check and say "i meant to use these".
-mike

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-04-14 20:28 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-13 19:21 UTF-8 in glibc commit messages Paul Eggert
2021-04-13 20:19 ` Joseph Myers
2021-04-14  0:06   ` Paul Eggert
2021-04-14 15:01     ` Mike Frysinger
2021-04-14 17:41       ` DJ Delorie
2021-04-14 18:08       ` Paul Eggert
2021-04-14 18:16         ` Adhemerval Zanella
2021-04-14 18:24           ` Paul Eggert
2021-04-14 20:28         ` Mike Frysinger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).