public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug analyzer/109098] New: Encoding errors on SARIF output for non-UTF-8 source files
@ 2023-03-11  0:22 dmalcolm at gcc dot gnu.org
  2023-03-11  0:24 ` [Bug analyzer/109098] " dmalcolm at gcc dot gnu.org
                   ` (9 more replies)
  0 siblings, 10 replies; 11+ messages in thread
From: dmalcolm at gcc dot gnu.org @ 2023-03-11  0:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109098

            Bug ID: 109098
           Summary: Encoding errors on SARIF output for non-UTF-8 source
                    files
           Product: gcc
           Version: 13.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: analyzer
          Assignee: dmalcolm at gcc dot gnu.org
          Reporter: dmalcolm at gcc dot gnu.org
  Target Milestone: ---

I've run into issues with -fanalyzer integration testing when a diagnostic
occurs in a source file with non-UTF-8 bytes in it.

Specifically, haproxy-2.7.1's include/import/ist.h has a comment within
function "istalloc" that isn't quite UTF-8.

When a diagnostic occurs on this, with  -fdiagnostics-format=sarif-file, the
file is quoted in the SARIF output, but the individual bytes are copied
directly into the JSON, leading to e.g.:

$ sarif ls integration-tests/haproxy-2.7.1/haproxy-2.7.1/http_client.c.sarif
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/sarif/loader.py", line 57, in
load_sarif_file
    data = json.load(file_in)
  File "/usr/lib64/python3.8/json/__init__.py", line 293, in load
    return loads(fp.read(),
  File "/usr/lib64/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 96006:
invalid start byte

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/sarif", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/site-packages/sarif/cmdline/main.py", line 40,
in main
    exitcode = args.func(args)
  File "/usr/local/lib/python3.8/site-packages/sarif/cmdline/main.py", line
384, in _ls
    ls_op.print_ls(args.files_or_dirs, args.output)
  File "/usr/local/lib/python3.8/site-packages/sarif/operations/ls_op.py", line
17, in print_ls
    sarif_files = loader.load_sarif_files(path)
  File "/usr/local/lib/python3.8/site-packages/sarif/loader.py", line 30, in
load_sarif_files
    path_exists = _add_path_to_sarif_file_set(path, ret)
  File "/usr/local/lib/python3.8/site-packages/sarif/loader.py", line 17, in
_add_path_to_sarif_file_set
    sarif_file_set.add_file(load_sarif_file(path))
  File "/usr/local/lib/python3.8/site-packages/sarif/loader.py", line 60, in
load_sarif_file
    raise IOError(f"Cannot load {file_path}") from exception
OSError: integration-tests/haproxy-2.7.1/haproxy-2.7.1/http_client.c.sarif

I'm working on a fix.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug analyzer/109098] Encoding errors on SARIF output for non-UTF-8 source files
  2023-03-11  0:22 [Bug analyzer/109098] New: Encoding errors on SARIF output for non-UTF-8 source files dmalcolm at gcc dot gnu.org
@ 2023-03-11  0:24 ` dmalcolm at gcc dot gnu.org
  2023-03-11  0:26 ` pinskia at gcc dot gnu.org
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: dmalcolm at gcc dot gnu.org @ 2023-03-11  0:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109098

David Malcolm <dmalcolm at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2023-03-11
             Status|UNCONFIRMED                 |ASSIGNED
     Ever confirmed|0                           |1

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug analyzer/109098] Encoding errors on SARIF output for non-UTF-8 source files
  2023-03-11  0:22 [Bug analyzer/109098] New: Encoding errors on SARIF output for non-UTF-8 source files dmalcolm at gcc dot gnu.org
  2023-03-11  0:24 ` [Bug analyzer/109098] " dmalcolm at gcc dot gnu.org
@ 2023-03-11  0:26 ` pinskia at gcc dot gnu.org
  2023-03-11  0:28 ` pinskia at gcc dot gnu.org
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-03-11  0:26 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109098

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
I would have assumed you need -finput-charset= for the non-utf8 ones really if
your LANG/LANGUAGE is not set to C/UTF8 really.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug analyzer/109098] Encoding errors on SARIF output for non-UTF-8 source files
  2023-03-11  0:22 [Bug analyzer/109098] New: Encoding errors on SARIF output for non-UTF-8 source files dmalcolm at gcc dot gnu.org
  2023-03-11  0:24 ` [Bug analyzer/109098] " dmalcolm at gcc dot gnu.org
  2023-03-11  0:26 ` pinskia at gcc dot gnu.org
@ 2023-03-11  0:28 ` pinskia at gcc dot gnu.org
  2023-03-11  0:31 ` dmalcolm at gcc dot gnu.org
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-03-11  0:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109098

--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
https://gcc.gnu.org/onlinedocs/gcc-12.2.0/gcc/Preprocessor-Options.html#index-finput-charset

Even has the following:

-finput-charset=charset
Set the input character set, used for translation from the character set of the
input file to the source character set used by GCC. If the locale does not
specify, or GCC cannot get this information from the locale, the default is
UTF-8. This can be overridden by either the locale or this command-line option.
Currently the command-line option takes precedence if there’s a conflict.
charset can be any encoding supported by the system’s iconv library routine.

So I think there is a bug in that code ...

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug analyzer/109098] Encoding errors on SARIF output for non-UTF-8 source files
  2023-03-11  0:22 [Bug analyzer/109098] New: Encoding errors on SARIF output for non-UTF-8 source files dmalcolm at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2023-03-11  0:28 ` pinskia at gcc dot gnu.org
@ 2023-03-11  0:31 ` dmalcolm at gcc dot gnu.org
  2023-03-11  0:33 ` dmalcolm at gcc dot gnu.org
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: dmalcolm at gcc dot gnu.org @ 2023-03-11  0:31 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109098

--- Comment #3 from David Malcolm <dmalcolm at gcc dot gnu.org> ---
(In reply to Andrew Pinski from comment #1)
> I would have assumed you need -finput-charset= for the non-utf8 ones really
> if your LANG/LANGUAGE is not set to C/UTF8 really.

Yeah, but when complaining about encoding issues, the error message we emit
should at least be properly encoded :/

It's a major pain for my integration testing where two(?) bad bytes in one
source file lead to an unparseable .sarif file (out of thousands).

When quoting source in the .sarif output, we should ensure that the final JSON
output is all valid UTF-8, perhaps falling back to not quoting source for cases
where e.g.
- the source file isn't validly encoded, or
- the -finput-charset= is wrong, or   
- the -finput-charset= is missing or
- where the source file (erroneously) uses a mixture of different encodings in
different 
parts of itself

Probably should also check we do something sane for trojan source attacks

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug analyzer/109098] Encoding errors on SARIF output for non-UTF-8 source files
  2023-03-11  0:22 [Bug analyzer/109098] New: Encoding errors on SARIF output for non-UTF-8 source files dmalcolm at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2023-03-11  0:31 ` dmalcolm at gcc dot gnu.org
@ 2023-03-11  0:33 ` dmalcolm at gcc dot gnu.org
  2023-03-11  0:55 ` hp at gcc dot gnu.org
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: dmalcolm at gcc dot gnu.org @ 2023-03-11  0:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109098

--- Comment #4 from David Malcolm <dmalcolm at gcc dot gnu.org> ---
(In reply to Andrew Pinski from comment #2)
> So I think there is a bug in that code ...

The issue is in sarif_builder::maybe_make_artifact_content_object, which uses;

 char *text_utf8 = maybe_read_file (filename);

where there's no guarantee that "text_utf8" is (ahem) actually utf-8.  Sorry
about that.

Working on a fix to make it use the input.cc source-quoting machinery, which
should handle encoding.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug analyzer/109098] Encoding errors on SARIF output for non-UTF-8 source files
  2023-03-11  0:22 [Bug analyzer/109098] New: Encoding errors on SARIF output for non-UTF-8 source files dmalcolm at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2023-03-11  0:33 ` dmalcolm at gcc dot gnu.org
@ 2023-03-11  0:55 ` hp at gcc dot gnu.org
  2023-03-13 21:46 ` joseph at codesourcery dot com
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: hp at gcc dot gnu.org @ 2023-03-11  0:55 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109098

Hans-Peter Nilsson <hp at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |hp at gcc dot gnu.org

--- Comment #5 from Hans-Peter Nilsson <hp at gcc dot gnu.org> ---
While considering UTF-8 in SARIF files, please also have a look at PR105959.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug analyzer/109098] Encoding errors on SARIF output for non-UTF-8 source files
  2023-03-11  0:22 [Bug analyzer/109098] New: Encoding errors on SARIF output for non-UTF-8 source files dmalcolm at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2023-03-11  0:55 ` hp at gcc dot gnu.org
@ 2023-03-13 21:46 ` joseph at codesourcery dot com
  2023-03-25  1:01 ` dmalcolm at gcc dot gnu.org
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: joseph at codesourcery dot com @ 2023-03-13 21:46 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109098

--- Comment #6 from joseph at codesourcery dot com <joseph at codesourcery dot com> ---
For diagnosis of non-UTF-8 in strings / comments, see commit 
0b8c57ed40f19086e30ce54faec3222ac21cc0df, "libcpp: Add -Winvalid-utf8 
warning [PR106655]" (implementing a new C++ requirement).

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug analyzer/109098] Encoding errors on SARIF output for non-UTF-8 source files
  2023-03-11  0:22 [Bug analyzer/109098] New: Encoding errors on SARIF output for non-UTF-8 source files dmalcolm at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2023-03-13 21:46 ` joseph at codesourcery dot com
@ 2023-03-25  1:01 ` dmalcolm at gcc dot gnu.org
  2023-03-25  1:57 ` hp at gcc dot gnu.org
  2023-03-27 15:34 ` dmalcolm at gcc dot gnu.org
  9 siblings, 0 replies; 11+ messages in thread
From: dmalcolm at gcc dot gnu.org @ 2023-03-25  1:01 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109098

David Malcolm <dmalcolm at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|---                         |FIXED

--- Comment #7 from David Malcolm <dmalcolm at gcc dot gnu.org> ---
Should be fixed on trunk by r13-6861-gd495ea2b232f3e:

https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=d495ea2b232f3eb50155d7c7362c09a744766746

https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff_plain;h=d495ea2b232f3eb50155d7c7362c09a744766746

The invalid UTF-8 in the patch seems to have broken the server-side script:

Enumerating objects: 51, done.
Counting objects: 100% (51/51), done.
Delta compression using up to 64 threads
Compressing objects: 100% (29/29), done.
Writing objects: 100% (29/29), 7.74 KiB | 1.29 MiB/s, done.
Total 29 (delta 22), reused 0 (delta 0), pack-reused 0
remote: Traceback (most recent call last):
remote:   File "hooks/post_receive.py", line 118, in <module>
remote:     post_receive(refs_data, args.submitter_email)
remote:   File "hooks/post_receive.py", line 65, in post_receive
remote:     submitter_email)
remote:   File "hooks/post_receive.py", line 47, in post_receive_one
remote:     update.send_email_notifications()
remote:   File
"/sourceware1/projects/src-home/git-hooks/hooks/updates/__init__.py", line 189,
in send_email_notifications
remote:     self.__email_new_commits()
remote:   File
"/sourceware1/projects/src-home/git-hooks/hooks/updates/__init__.py", line
1031, in __email_new_commits
remote:     commit, self.get_standard_commit_email(commit))
remote:   File
"/sourceware1/projects/src-home/git-hooks/hooks/updates/__init__.py", line
1011, in __send_commit_email
remote:     default_diff=email.diff)
remote:   File
"/sourceware1/projects/src-home/git-hooks/hooks/updates/__init__.py", line 946,
in __maybe_get_email_custom_contents
remote:     hook_input=json.dumps(hooks_data),
remote:   File "/usr/lib64/python2.7/json/__init__.py", line 244, in dumps
remote:     return _default_encoder.encode(obj)
remote:   File "/usr/lib64/python2.7/json/encoder.py", line 207, in encode
remote:     chunks = self.iterencode(o, _one_shot=True)
remote:   File "/usr/lib64/python2.7/json/encoder.py", line 270, in iterencode
remote:     return _iterencode(o, 0)
remote: UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position
13147: invalid start byte
To git+ssh://gcc.gnu.org/git/gcc.git
   13ec81eb4c3..d495ea2b232  master -> master

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug analyzer/109098] Encoding errors on SARIF output for non-UTF-8 source files
  2023-03-11  0:22 [Bug analyzer/109098] New: Encoding errors on SARIF output for non-UTF-8 source files dmalcolm at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2023-03-25  1:01 ` dmalcolm at gcc dot gnu.org
@ 2023-03-25  1:57 ` hp at gcc dot gnu.org
  2023-03-27 15:34 ` dmalcolm at gcc dot gnu.org
  9 siblings, 0 replies; 11+ messages in thread
From: hp at gcc dot gnu.org @ 2023-03-25  1:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109098

--- Comment #8 from Hans-Peter Nilsson <hp at gcc dot gnu.org> ---
(In reply to David Malcolm from comment #7)
> The invalid UTF-8 in the patch seems to have broken the server-side script:

Maybe the not-really-utf8 files need to be marked in some way in the git repo
to be safely handled for future checkout and updates, including the problematic
scripting?  However, reading gitattributes(5) it's not obvious how.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug analyzer/109098] Encoding errors on SARIF output for non-UTF-8 source files
  2023-03-11  0:22 [Bug analyzer/109098] New: Encoding errors on SARIF output for non-UTF-8 source files dmalcolm at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2023-03-25  1:57 ` hp at gcc dot gnu.org
@ 2023-03-27 15:34 ` dmalcolm at gcc dot gnu.org
  9 siblings, 0 replies; 11+ messages in thread
From: dmalcolm at gcc dot gnu.org @ 2023-03-27 15:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109098

--- Comment #9 from David Malcolm <dmalcolm at gcc dot gnu.org> ---
(In reply to Hans-Peter Nilsson from comment #8)
> (In reply to David Malcolm from comment #7)
> > The invalid UTF-8 in the patch seems to have broken the server-side script:
> 
> Maybe the not-really-utf8 files need to be marked in some way in the git
> repo to be safely handled for future checkout and updates, including the
> problematic scripting?  However, reading gitattributes(5) it's not obvious
> how.

Perhaps
  https://www.git-scm.com/docs/gitattributes#_marking_files_as_binary 
though it's not clear if we can do that for individual files (or if it's worth
bothering)

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2023-03-27 15:35 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-11  0:22 [Bug analyzer/109098] New: Encoding errors on SARIF output for non-UTF-8 source files dmalcolm at gcc dot gnu.org
2023-03-11  0:24 ` [Bug analyzer/109098] " dmalcolm at gcc dot gnu.org
2023-03-11  0:26 ` pinskia at gcc dot gnu.org
2023-03-11  0:28 ` pinskia at gcc dot gnu.org
2023-03-11  0:31 ` dmalcolm at gcc dot gnu.org
2023-03-11  0:33 ` dmalcolm at gcc dot gnu.org
2023-03-11  0:55 ` hp at gcc dot gnu.org
2023-03-13 21:46 ` joseph at codesourcery dot com
2023-03-25  1:01 ` dmalcolm at gcc dot gnu.org
2023-03-25  1:57 ` hp at gcc dot gnu.org
2023-03-27 15:34 ` dmalcolm at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).