From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx2.suse.de (mx2.suse.de [195.135.220.15]) by sourceware.org (Postfix) with ESMTPS id 972F93858D33 for ; Wed, 6 Jan 2021 07:25:54 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 972F93858D33 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=suse.cz Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=mliska@suse.cz X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 4B33BB337; Wed, 6 Jan 2021 07:25:53 +0000 (UTC) Subject: Re: Patch RFA: Support non-ASCII file names in git-changelog From: =?UTF-8?Q?Martin_Li=c5=a1ka?= To: Joel Brobecker Cc: Jakub Jelinek , Jonathan Wakely , gcc-patches , Ian Lance Taylor References: <2b8fc5da-0a7e-2feb-9d22-6fecc349d842@suse.cz> <733ffec8-8809-d7fd-f0bf-9b1d9a55d7fc@suse.cz> <20201221094837.GG3788@tucnak> <64528957-cf87-676a-70cd-7fdd5bfeaf17@suse.cz> <20201224121638.GL353421@adacore.com> <8beaddc2-402d-b90c-6d53-2903f92275a2@suse.cz> <3c44a148-9514-cf34-0e76-cba9b08b5027@suse.cz> Message-ID: <868ce542-e180-dfd4-1bdd-6730a1011494@suse.cz> Date: Wed, 6 Jan 2021 08:25:52 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.6.0 MIME-Version: 1.0 In-Reply-To: <3c44a148-9514-cf34-0e76-cba9b08b5027@suse.cz> Content-Type: multipart/mixed; boundary="------------C8C5E83C05067B9245C46BC7" Content-Language: en-US X-Spam-Status: No, score=-11.1 required=5.0 tests=BAYES_00, GIT_PATCH_0, KAM_DMARC_STATUS, NICE_REPLY_A, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 06 Jan 2021 07:25:56 -0000 This is a multi-part message in MIME format. --------------C8C5E83C05067B9245C46BC7 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit On 1/4/21 12:47 PM, Martin Liška wrote: > On 1/4/21 12:01 PM, Martin Liška wrote: >> Anyway, I'm going to update server hook first and I'll create an issue for GitPython. > > So I was not correct about this. Also the server hooks uses now GitPython > to identify modified files. > > I've just created an issue for that: > https://github.com/gitpython-developers/GitPython/issues/1099 This one got fixed and it's present in the newly done release v3.1.12. Anyway, I've got a workaround that I'm going to push. Martin > > Martin --------------C8C5E83C05067B9245C46BC7 Content-Type: text/x-patch; charset=UTF-8; name="0001-gcc-changelog-workaround-for-utf8-filenames.patch" Content-Transfer-Encoding: 8bit Content-Disposition: attachment; filename="0001-gcc-changelog-workaround-for-utf8-filenames.patch" >From ed9ffe47d6964dc92c92cfddbb8aac555c28e085 Mon Sep 17 00:00:00 2001 From: Martin Liska Date: Wed, 6 Jan 2021 08:11:57 +0100 Subject: [PATCH] gcc-changelog: workaround for utf8 filenames contrib/ChangeLog: * gcc-changelog/git_commit.py: Add decode_path function. * gcc-changelog/git_email.py: Use it in order to solve utf8 encoding filename issues. * gcc-changelog/git_repository.py: Likewise. * gcc-changelog/test_email.py: Test it. --- contrib/gcc-changelog/git_commit.py | 26 +++++++++++++++++-------- contrib/gcc-changelog/git_email.py | 6 +++--- contrib/gcc-changelog/git_repository.py | 6 +++--- contrib/gcc-changelog/test_email.py | 3 ++- 4 files changed, 26 insertions(+), 15 deletions(-) diff --git a/contrib/gcc-changelog/git_commit.py b/contrib/gcc-changelog/git_commit.py index d2e5dbe294a..ee1973371be 100755 --- a/contrib/gcc-changelog/git_commit.py +++ b/contrib/gcc-changelog/git_commit.py @@ -174,6 +174,24 @@ REVIEW_PREFIXES = ('reviewed-by: ', 'reviewed-on: ', 'signed-off-by: ', DATE_FORMAT = '%Y-%m-%d' +def decode_path(path): + # When core.quotepath is true (default value), utf8 chars are encoded like: + # "b/ko\304\215ka.txt" + # + # The upstream bug is fixed: + # https://github.com/gitpython-developers/GitPython/issues/1099 + # + # but we still need a workaround for older versions of the library. + # Please take a look at the explanation of the transformation: + # https://stackoverflow.com/questions/990169/how-do-convert-unicode-escape-sequences-to-unicode-characters-in-a-python-string + + if path.startswith('"') and path.endswith('"'): + return (path.strip('"').encode('utf8').decode('unicode-escape') + .encode('latin-1').decode('utf8')) + else: + return path + + class Error: def __init__(self, message, line=None): self.message = message @@ -303,14 +321,6 @@ class GitCommit: 'separately from normal commits')) return - # check for an encoded utf-8 filename - hint = 'git config --global core.quotepath false' - for modified, _ in self.info.modified_files: - if modified.startswith('"') or modified.endswith('"'): - self.errors.append(Error('Quoted UTF8 filename, please set: ' - f'"{hint}"', modified)) - return - all_are_ignored = (len(project_files) + len(ignored_files) == len(self.info.modified_files)) self.parse_lines(all_are_ignored) diff --git a/contrib/gcc-changelog/git_email.py b/contrib/gcc-changelog/git_email.py index 5b53ca4a6a9..00ad00458f4 100755 --- a/contrib/gcc-changelog/git_email.py +++ b/contrib/gcc-changelog/git_email.py @@ -22,7 +22,7 @@ from itertools import takewhile from dateutil.parser import parse -from git_commit import GitCommit, GitInfo +from git_commit import GitCommit, GitInfo, decode_path from unidiff import PatchSet, PatchedFile @@ -52,8 +52,8 @@ class GitEmail(GitCommit): modified_files = [] for f in diff: # Strip "a/" and "b/" prefixes - source = f.source_file[2:] - target = f.target_file[2:] + source = decode_path(f.source_file)[2:] + target = decode_path(f.target_file)[2:] if f.is_added_file: t = 'A' diff --git a/contrib/gcc-changelog/git_repository.py b/contrib/gcc-changelog/git_repository.py index 8edcff91ad6..a0e293d756d 100755 --- a/contrib/gcc-changelog/git_repository.py +++ b/contrib/gcc-changelog/git_repository.py @@ -26,7 +26,7 @@ except ImportError: print(' Debian, Ubuntu: python3-git') exit(1) -from git_commit import GitCommit, GitInfo +from git_commit import GitCommit, GitInfo, decode_path def parse_git_revisions(repo_path, revisions, strict=True): @@ -51,11 +51,11 @@ def parse_git_revisions(repo_path, revisions, strict=True): # Consider that renamed files are two operations: # the deletion of the original name # and the addition of the new one. - modified_files.append((file.a_path, 'D')) + modified_files.append((decode_path(file.a_path), 'D')) t = 'A' else: t = 'M' - modified_files.append((file.b_path, t)) + modified_files.append((decode_path(file.b_path), t)) date = datetime.utcfromtimestamp(c.committed_date) author = '%s <%s>' % (c.author.name, c.author.email) diff --git a/contrib/gcc-changelog/test_email.py b/contrib/gcc-changelog/test_email.py index 2053531452c..5db56caef9e 100755 --- a/contrib/gcc-changelog/test_email.py +++ b/contrib/gcc-changelog/test_email.py @@ -402,4 +402,5 @@ class TestGccChangelog(unittest.TestCase): def test_bad_unicode_chars_in_filename(self): email = self.from_patch_glob('0001-Add-horse2.patch') - assert email.errors[0].message.startswith('Quoted UTF8 filename') + assert not email.errors + assert email.changelog_entries[0].files == ['koníček.txt'] -- 2.29.2 --------------C8C5E83C05067B9245C46BC7--