From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 50B9E385840E; Sat, 11 Mar 2023 00:31:26 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 50B9E385840E DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1678494686; bh=Tl6VSZPs9kGBcG5VbEA3AhQkpeMvO88zv+m5FQRvaUw=; h=From:To:Subject:Date:In-Reply-To:References:From; b=KAtc9lMQoS/iHQP1ngHmCDAb+6EzHRIr0YTIpjoeMyKcA6p3w/+YKatQhTLF9SeHO gYCW5yXGC5XweHV3lhI+ezhvdnu+FLpVlqVtoN6N6GL0c8inaG6qg0U7KHasPDCjgV EMBr7dMenlj+3CxbHyQkKFMaOsmIKMHCWrGgVcAU= From: "dmalcolm at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug analyzer/109098] Encoding errors on SARIF output for non-UTF-8 source files Date: Sat, 11 Mar 2023 00:31:26 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: analyzer X-Bugzilla-Version: 13.0 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: dmalcolm at gcc dot gnu.org X-Bugzilla-Status: ASSIGNED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: dmalcolm at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D109098 --- Comment #3 from David Malcolm --- (In reply to Andrew Pinski from comment #1) > I would have assumed you need -finput-charset=3D for the non-utf8 ones re= ally > if your LANG/LANGUAGE is not set to C/UTF8 really. Yeah, but when complaining about encoding issues, the error message we emit should at least be properly encoded :/ It's a major pain for my integration testing where two(?) bad bytes in one source file lead to an unparseable .sarif file (out of thousands). When quoting source in the .sarif output, we should ensure that the final J= SON output is all valid UTF-8, perhaps falling back to not quoting source for c= ases where e.g. - the source file isn't validly encoded, or - the -finput-charset=3D is wrong, or=20=20=20 - the -finput-charset=3D is missing or - where the source file (erroneously) uses a mixture of different encodings= in different=20 parts of itself Probably should also check we do something sane for trojan source attacks=