Unicode security

public inbox for binutils@sourceware.org
 help / color / mirror / Atom feed

* Unicode security
@ 2022-01-10 13:31 Reini Urban
  2022-01-10 21:07 ` Joseph Myers
  2022-01-21 15:55 ` Nick Clifton
  0 siblings, 2 replies; 9+ messages in thread
From: Reini Urban @ 2022-01-10 13:31 UTC (permalink / raw)
  To: binutils

Hi

Just a heads up from unicode:
Now that gcc has joined the long list of supporters of insecure unicode
identifiers, which means that identifiers are not identifiable for
attackers abusing utf8 homoglyphs, spoofing  or even bidi, the chance is
higher for some real-world attacks. So far it was only D, clang (since 3.3)
and exotic languages (like nim, crystal) to support binary chunks as names,
such as the typical linux filesystem.

there's no problem with ld and bfd per se. bfd has its names named as
symbols, not identifiers. symbols are permitted to be unreadable and
unidentifiable binary chunk.
Problems are object files being used as ABI and inherently as API (via
headers, ffi's and linker scripts, .def files)

I outlined it here
https://github.com/rurban/libu8ident/blob/master/c23%2B%2Bproposal.md#12-issues-with-binutils-linkers-exported-identifiers
in my C23++ (and C23) proposal to follow the unicode security guidelines
for identifiers TR39. This is not yet finished, still in work to get some
stats and a better TR31 charset subset for XID's. (identifiers)

See eg. this C file:

#include <assert.h>
int  الناس = 0;

int الإء() {
    return  الناس;
}
int main() {
  int ير = 1;
  assert(ير == 1 );
  return الإء();
}

which can now be compiled with gcc-10. leading to different interpretations
in the c-preprocessor:
gcc cpp =>
# 2 "texts/arabic-1.c"
int \U00000627\U00000644\U00000646\U00000627\U00000633 = 0;
int \U00000627\U00000644\U00000625\U00000621() {
    return \U00000627\U00000644\U00000646\U00000627\U00000633;
}

i.e. interpretation as utf-8, converted to extended identifiers with \U
codepoints

in llvm/clang cpp:
# 2 "texts/arabic-1.c" 2
int الناس = 0;

int الإء() {
    return الناس;
}
ie. kept utf-8 asis. and its -emit-llvm does
@"\D8\A7\D9\84\D9\86\D8\A7\D8\B3" = dso_local global i32 0, align 4
@.str = private unnamed_addr constant [10 x i8] c"\D9\8A\D8\B1 == 1\00",
align 1
@.str.1 = private unnamed_addr constant [17 x i8] c"texts/arabic-1.c\00",
align 1
@__PRETTY_FUNCTION__.main = private unnamed_addr constant [11 x i8] c"int
main()\00", align 1

; Function Attrs: noinline nounwind optnone uwtable
define dso_local i32 @"\D8\A7\D9\84\D8\A5\D8\A1"() #0 {
  %1 = load i32, i32* @"\D8\A7\D9\84\D9\86\D8\A7\D8\B3", align 4
  ret i32 %1
}
...

keeping the UTF-8 bytes.

now to binutils:
 nm arabic-1.o
                 U __assert_fail
0000000000000010 T main
0000000000000000 T الإء
0000000000000000 B

of course, as utf-8 chars are kept asis. the exported functions can include
homoglyphs and if so will display all variants asis, and without unicode
tools you'll have no idea which is what.
but what if the object file was compiled with some compiler in the
SHIFT-JIS or KOI8
or even worse in utf-8 with cyrillic homoglyphic letters. A FFI or linker
will have a hard time linking to that.

So sooner or later some ELF/COFF/bla header field will be needed to state
the obvious:
name is UTF-8.
and sooner or later binutils will need to restrict its symbols to be
identifiable,
also as linux filesystems.
therefore I'll provide the utils for unicode security for identifiers here:
https://github.com/rurban/libu8ident
it's mostly a restriction for id_start and id_cont characters (from some
recommended scripts), to check for illegal combining marks, to check for
illegal mixed scripts, and to check for normalization issues.

bfd needs to find names and it could lookup names normalized. (e.g. NFC).
The C23++ standard has a proposal to demand NFC only, so most combinings
marks will become illegal, only NFC names are allowed.
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p1949r7.html (all
in favor)
what's standardized for C23++/C23 will be good enough for ld also, I
suppose. just that there's no -std=c23 flag or such.
and not even grep can search normalized strings yet. well, someones has to
start, and it will be C++. In all fairness, first was Java, than my cperl,
then Rust which did unicode support properly.

e.g. for binutils there will be a olint needed, linting object files for
un-identifiable names in objects and libraries.
with bfd/ld/objdump it could also start as a warning e.g., as the recent
gcc bidi warning.

I have now the following errors:
ENCODING, XID, SCRIPT, SCRIPTS, COMBINE, optional CONFUS.
ENCODING checks for illegal UTF-8 encodings.
XID checks for violations of TR31 character sets for identifiers. Allowed
IdentifierStatus (TR39) is a good set, but for C23 there will be a
different set.
SCRIPT checks for disallowed, uncommon scripts (languages) defined in TR39.
SCRIPTS checks against TR39 violations against a mixed scripts profile,
where the recommended profile is Moderately Restrictive or a C23 variant
C23_4, which allows Greek (math) letters together with Latin.
COMBINE checks against illegal combining mark sequences where the mark does
not fit the base char. (TR39)

CONFUS is just bikeshedding for cooperate language lawyers, but the rest
are real security problems.
-- 
Reini Urban

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Unicode security
  2022-01-10 13:31 Unicode security Reini Urban
@ 2022-01-10 21:07 ` Joseph Myers
  2022-01-10 21:38   ` Paul Koning
  2022-01-21 15:55 ` Nick Clifton
  1 sibling, 1 reply; 9+ messages in thread
From: Joseph Myers @ 2022-01-10 21:07 UTC (permalink / raw)
  To: Reini Urban; +Cc: binutils

On Mon, 10 Jan 2022, Reini Urban via Binutils wrote:

> So sooner or later some ELF/COFF/bla header field will be needed to state
> the obvious:
> name is UTF-8.

I think that's a matter for the ELF gABI document, where it describes the 
*C binding to ELF* (where it says "External C symbols have the same names 
in C and object files' symbol tables." - which says nothing about 
encoding, since the point of that statement in its historical context was 
probably to imply "no leading underscores added").  That is, ELF symbols 
are arbitrary 0-terminated octet sequences (ELF is not limited to C 
objects, symbols only need to be interpreted when included in diagnostics, 
the assembler and linker should allow you to work with objects with 
arbitrary 0-terminated octet sequences for symbols if you want to, just as 
you can use ASCII characters in ELF symbols that aren't valid in C 
identifiers), but when ELF is used for objects compiled from C, those 
octet sequences for C identifiers with external linkage need to be 
interpreted in a particular way.

Maybe such a change could be proposed on the generic-abi list once Cary's 
public repository is available.

I attempted to get such a statement about UTF-8 encoding of ELF symbols 
for C identifiers with external linkage into the gABI in January 2005, and 
was directed to the ia64-abi@unix-os.sc.intel.com mailing list for that 
purpose.  At that time, the gABI maintainers on that mailing list weren't 
willing to accept such a change, but maybe the current maintainers on the 
current mailing list would be.  If it's still not acceptable for the gABI, 
then an operating-system-specific ABI would be the place to go.

DWARF (version 3 and later) does have DW_AT_use_UTF8.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Unicode security
  2022-01-10 21:07 ` Joseph Myers
@ 2022-01-10 21:38   ` Paul Koning
  2022-01-10 22:13     ` Joseph Myers
  0 siblings, 1 reply; 9+ messages in thread
From: Paul Koning @ 2022-01-10 21:38 UTC (permalink / raw)
  To: Joseph Myers; +Cc: Reini Urban, binutils

> On Jan 10, 2022, at 4:07 PM, Joseph Myers <joseph@codesourcery.com> wrote:
> 
> On Mon, 10 Jan 2022, Reini Urban via Binutils wrote:
> 
>> So sooner or later some ELF/COFF/bla header field will be needed to state
>> the obvious:
>> name is UTF-8.
> 
> I think that's a matter for the ELF gABI document, where it describes the 
> *C binding to ELF* (where it says "External C symbols have the same names 
> in C and object files' symbol tables." - which says nothing about 
> encoding, since the point of that statement in its historical context was 
> probably to imply "no leading underscores added"). 

One complication is that "the same name" has a trivially obvious meaning with ASCII (identical byte strings) but not with Unicode, where a given string can be encoded several ways.  This is why "normalization" exists in Unicode, as a way to convert valid Unicode strings into a single representation so they can be easily compared as byte strings.  But to make matters somewhat complicated, there are several normalizations.  A standard that needs to handle Unicode and have a definition of "equal strings" will want to refer to a particular normalization.  For example, the IETF iSCSI standard does this; in that particular case the normalization used is one that folds case, but there are also non-folding normalizations.

As an example of a string where this matters, "é" can be encoded either as the "precomposed" character "lowercase e with acute accent" or "lowercase e" followed by "combining acute accent".  Both represent the same string.  

	paul

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Unicode security
  2022-01-10 21:38   ` Paul Koning
@ 2022-01-10 22:13     ` Joseph Myers
  2022-01-11  0:41       ` Paul Koning
  2022-01-21 22:22       ` Mike Frysinger
  0 siblings, 2 replies; 9+ messages in thread
From: Joseph Myers @ 2022-01-10 22:13 UTC (permalink / raw)
  To: Paul Koning; +Cc: binutils

On Mon, 10 Jan 2022, Paul Koning via Binutils wrote:

> A standard that needs to handle Unicode and have a definition of "equal 
> strings" will want to refer to a particular normalization.

For the purposes of ELF, equal strings are equal octet sequences, with no 
further interpretation.

The ELF bindings to C do not need a concept of "equal", they just need to 
say that UTF-8 is used to encode the sequence of Unicode code points in 
the C symbol.  Those bindings need to handle multiple C versions with 
different sets of allowed characters in identifiers, some of which allow 
identifiers that are different as sequences of Unicode code points, and 
thus different in C and in UTF-8, although the same in NFC.  In those 
cases, the bindings need to result in different octet sequences in ELF 
symbols for those different (but normalized the same) C identifiers.  When 
a C identifier is written in NFC, so must the ELF symbol be; when a C 
identifier is written in NFD, so must the ELF symbol be; when a C 
identifier is in neither normalization form, so must the ELF symbol be.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Unicode security
  2022-01-10 22:13     ` Joseph Myers
@ 2022-01-11  0:41       ` Paul Koning
  2022-01-11  0:50         ` Joseph Myers
  2022-01-21 22:22       ` Mike Frysinger
  1 sibling, 1 reply; 9+ messages in thread
From: Paul Koning @ 2022-01-11  0:41 UTC (permalink / raw)
  To: Joseph Myers; +Cc: binutils



> On Jan 10, 2022, at 5:13 PM, Joseph Myers <joseph@codesourcery.com> wrote:
> 
> On Mon, 10 Jan 2022, Paul Koning via Binutils wrote:
> 
>> A standard that needs to handle Unicode and have a definition of "equal 
>> strings" will want to refer to a particular normalization.
> 
> For the purposes of ELF, equal strings are equal octet sequences, with no 
> further interpretation.
> 
> The ELF bindings to C do not need a concept of "equal", they just need to 
> say that UTF-8 is used to encode the sequence of Unicode code points in 
> the C symbol.  Those bindings need to handle multiple C versions with 
> different sets of allowed characters in identifiers, some of which allow 
> identifiers that are different as sequences of Unicode code points, and 
> thus different in C and in UTF-8, although the same in NFC.  In those 
> cases, the bindings need to result in different octet sequences in ELF 
> symbols for those different (but normalized the same) C identifiers.  When 
> a C identifier is written in NFC, so must the ELF symbol be; when a C 
> identifier is written in NFD, so must the ELF symbol be; when a C 
> identifier is in neither normalization form, so must the ELF symbol be.

Yikes.  So if you use a different text editor than the previous author, or a different compiler, your C code with Unicode identifiers might suddenly get link errors because the same string got encoded a different way.

This is clearly bad; is there some way for this to be fixed?

	paul


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Unicode security
  2022-01-11  0:41       ` Paul Koning
@ 2022-01-11  0:50         ` Joseph Myers
  2022-01-21 16:52           ` Thomas Wolff
  0 siblings, 1 reply; 9+ messages in thread
From: Joseph Myers @ 2022-01-11  0:50 UTC (permalink / raw)
  To: Paul Koning; +Cc: binutils

On Mon, 10 Jan 2022, Paul Koning via Binutils wrote:

> Yikes.  So if you use a different text editor than the previous author, 
> or a different compiler, your C code with Unicode identifiers might 
> suddenly get link errors because the same string got encoded a different 
> way.
> 
> This is clearly bad; is there some way for this to be fixed?

Users should pay attention to compiler warnings (GCC enables 
-Wnormalized=nfc by default).

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Unicode security
  2022-01-10 13:31 Unicode security Reini Urban
  2022-01-10 21:07 ` Joseph Myers
@ 2022-01-21 15:55 ` Nick Clifton
  1 sibling, 0 replies; 9+ messages in thread
From: Nick Clifton @ 2022-01-21 15:55 UTC (permalink / raw)
  To: Reini Urban, binutils

Hi Reini,

> now to binutils:
>   nm arabic-1.o
>                   U __assert_fail
> 0000000000000010 T main
> 0000000000000000 T الإء
> 0000000000000000 B
> 
> of course, as utf-8 chars are kept asis. the exported functions can include
> homoglyphs and if so will display all variants asis, and without unicode
> tools you'll have no idea which is what.

Just to note that the latest binutils sources include an option to display
unicode characters as esacpe sequences or hex bytes, so that they can be
more easily detected:

   % nm -Ux arabic-1.o
                    U __assert_fail
   000000000000000c T main
   0000000000000000 T <0xd8a7><0xd984><0xd8a5><0xd8a1>
   0000000000000000 B <0xd8a7><0xd984><0xd986><0xd8a7><0xd8b3>

   % nm -Ue arabic-1.o
                    U __assert_fail
   000000000000000c T main
   0000000000000000 T \u0627\u0644\u0625\u0621
   0000000000000000 B \u0627\u0644\u0646\u0627\u0633

There is even an option to highlight unicode characters so that they stand
out even more, but this only works when the tool is run in terminal that
supports colour escape sequences.

Cheers
   Nick


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Unicode security
  2022-01-11  0:50         ` Joseph Myers
@ 2022-01-21 16:52           ` Thomas Wolff
  0 siblings, 0 replies; 9+ messages in thread
From: Thomas Wolff @ 2022-01-21 16:52 UTC (permalink / raw)
  To: binutils


Am 11.01.2022 um 01:50 schrieb Joseph Myers:
> On Mon, 10 Jan 2022, Paul Koning via Binutils wrote:
>
>> Yikes.  So if you use a different text editor than the previous author,
>> or a different compiler, your C code with Unicode identifiers might
>> suddenly get link errors because the same string got encoded a different way.
No, neither editors nor compilers should ever change identifiers (e.g. 
by normalization) under the hood, or I'd consider that buggy. So 
identifiers will stay as they are and can be matched when linking.
Only if people edit different files with different editors it could 
happen that visually equal strings are encoded differently but that's 
another issue and can be fixed.
>>
>> This is clearly bad; is there some way for this to be fixed?
> Users should pay attention to compiler warnings (GCC enables
> -Wnormalized=nfc by default).
Which are however just warnings, not transformations.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Unicode security
  2022-01-10 22:13     ` Joseph Myers
  2022-01-11  0:41       ` Paul Koning
@ 2022-01-21 22:22       ` Mike Frysinger
  1 sibling, 0 replies; 9+ messages in thread
From: Mike Frysinger @ 2022-01-21 22:22 UTC (permalink / raw)
  To: binutils

[-- Attachment #1: Type: text/plain, Size: 1410 bytes --]

On 10 Jan 2022 22:13, Joseph Myers wrote:
> On Mon, 10 Jan 2022, Paul Koning via Binutils wrote:
> > A standard that needs to handle Unicode and have a definition of "equal 
> > strings" will want to refer to a particular normalization.
> 
> For the purposes of ELF, equal strings are equal octet sequences, with no 
> further interpretation.
> 
> The ELF bindings to C do not need a concept of "equal", they just need to 
> say that UTF-8 is used to encode the sequence of Unicode code points in 
> the C symbol.  Those bindings need to handle multiple C versions with 
> different sets of allowed characters in identifiers, some of which allow 
> identifiers that are different as sequences of Unicode code points, and 
> thus different in C and in UTF-8, although the same in NFC.  In those 
> cases, the bindings need to result in different octet sequences in ELF 
> symbols for those different (but normalized the same) C identifiers.  When 
> a C identifier is written in NFC, so must the ELF symbol be; when a C 
> identifier is written in NFD, so must the ELF symbol be; when a C 
> identifier is in neither normalization form, so must the ELF symbol be.

this really is the only reasonable & maintainable position for the toolchain
projects to take.  higher level concerns about NFC are best left to higher
level diagnostics (like gcc's -W flag you highlighted already).
-mike

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-01-21 22:22 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-10 13:31 Unicode security Reini Urban
2022-01-10 21:07 ` Joseph Myers
2022-01-10 21:38   ` Paul Koning
2022-01-10 22:13     ` Joseph Myers
2022-01-11  0:41       ` Paul Koning
2022-01-11  0:50         ` Joseph Myers
2022-01-21 16:52           ` Thomas Wolff
2022-01-21 22:22       ` Mike Frysinger
2022-01-21 15:55 ` Nick Clifton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).