From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-return-200785-listarch-gcc=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 73284 invoked by alias); 2 Dec 2019 10:54:28 -0000
Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc/>
List-Post: <mailto:gcc@gcc.gnu.org>
List-Help: <http://gcc.gnu.org/ml/>
Sender: gcc-owner@gcc.gnu.org
Received: (qmail 73271 invoked by uid 89); 2 Dec 2019 10:54:27 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-8.8 required=5.0 tests=AWL,BAYES_00,GIT_PATCH_1,KAM_SHORT,SPF_PASS autolearn=ham version=3.3.1 spammy=cutting, unique, obviously, claimed
X-HELO: foss.arm.com
Received: from foss.arm.com (HELO foss.arm.com) (217.140.110.172) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 02 Dec 2019 10:54:22 +0000
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id C8D44328;	Mon,  2 Dec 2019 02:54:20 -0800 (PST)
Received: from e120077-lin.cambridge.arm.com (e120077-lin.cambridge.arm.com [10.2.78.81])	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 896263F68E;	Mon,  2 Dec 2019 02:54:19 -0800 (PST)
Subject: Re: Commit messages and the move to git
To: Jason Merrill <jason@redhat.com>
Cc: Segher Boessenkool <segher@kernel.crashing.org>, Eric Raymond <esr@thyrsus.com>, Jeff Law <law@redhat.com>, GCC Development <gcc@gcc.gnu.org>
References: <20191107142727.GA72444@thyrsus.com> <a5e2428c-60d5-3157-6b5f-662cfa76767b@arm.com> <20191109060151.GA82270@thyrsus.com> <78a57894-74f2-94d5-65a1-3ce2bd934f32@arm.com> <20191118155549.GH16031@gate.crashing.org> <fb6aa5da-9bc7-1fb5-4dee-efdcdf849a46@arm.com> <20191118171115.GI16031@gate.crashing.org> <ac635c47-884c-3832-193b-d45ce9336bca@arm.com> <20191118185328.GJ16031@gate.crashing.org> <f03489b4-0d59-4003-d13b-5f0d7e83289a@arm.com> <20191118194450.GL16031@gate.crashing.org> <CADzB+2mL-e8J3yTVcuT2yn0ZgHu_KHjRqqXE-nJ4sVODyYi8+A@mail.gmail.com> <c11d0e89-ac80-5600-276c-397531f0417c@arm.com> <CADzB+2nPi_Vt-72D_TDzuqN86+HE9ANsMbhMsOkCWRSVWHdjVA@mail.gmail.com>
From: "Richard Earnshaw (lists)" <Richard.Earnshaw@arm.com>
Message-ID: <632176b4-0ff6-1389-1576-138700ea900b@arm.com>
Date: Mon, 02 Dec 2019 10:54:00 -0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.9.0
MIME-Version: 1.0
In-Reply-To: <CADzB+2nPi_Vt-72D_TDzuqN86+HE9ANsMbhMsOkCWRSVWHdjVA@mail.gmail.com>
Content-Type: multipart/mixed; boundary="------------2B67D9E0B5C37962D655DDF4"
X-SW-Source: 2019-12/txt/msg00003.txt.bz2

This is a multi-part message in MIME format.
--------------2B67D9E0B5C37962D655DDF4
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Content-length: 6943

On 19/11/2019 14:56, Jason Merrill wrote:
> On Mon, Nov 18, 2019 at 4:38 PM Richard Earnshaw (lists) <
> Richard.Earnshaw@arm.com> wrote:
> 
>> On 18/11/2019 20:53, Jason Merrill wrote:
>>> On Mon, Nov 18, 2019 at 2:51 PM Segher Boessenkool <
>>> segher@kernel.crashing.org> wrote:
>>>
>>>> On Mon, Nov 18, 2019 at 07:21:22PM +0000, Richard Earnshaw (lists)
>> wrote:
>>>>> On 18/11/2019 18:53, Segher Boessenkool wrote:
>>>>>> PR target/92140: clang vs gcc optimizing with adc/sbb
>>>>>> PR fortran/91926: assumed rank optional
>>>>>> PR tree-optimization/91532: [SVE] Redundant predicated store in
>>>> gcc.target/aarch64/fmla_2.c
>>>>>> PR tree-optimization/92161: ICE in vect_get_vec_def_for_stmt_copy, at
>>>> tree-vect-stmts.c:1687
>>>>>> PR tree-optimization/92162: ICE in vect_create_epilog_for_reduction,
>>>> at tree-vect-loop.c:4252
>>>>>> PR c++/92015: internal compiler error: in cxx_eval_array_reference, at
>>>> cp/constexpr.c:2568
>>>>>> PR tree-optimization/92173: ICE in optab_for_tree_code, at
>>>> optabs-tree.c:81
>>>>>> PR tree-optimization/92173: ICE in optab_for_tree_code, at
>>>> optabs-tree.c:81
>>>>>> PR fortran/92174: runtime error: index 15 out of bounds for type
>>>> 'gfc_expr *[15]
>>>>>>
>>>>>> Most of these aren't helpful at all, and none of these are good commit
>>>>>> summaries.  The PR92173 one actually has identical commit messages
>> btw,
>>>>>> huh.  Ah, the second one (r277288) has the wrong changelog, but in the
>>>>>> actual changelog file as well, not something any tool could fix up (or
>>>>>> have we reached the singularity?)
>>>>>
>>>>> Identical commits are normally from where the same commit is made to
>>>>> multiple branches.  It's not uncommon to see this when bugs are fixed.
>>>>
>>>> This is an actual mistake.  The commits are not identical at all, just
>>>> the commit messages are (and the changelog entries, too).  Not something
>>>> that happens to ften, but of course I hit it in the first random thing I
>>>> pick :-)
>>>>
>>>>> Ultimately the question here is whether something like the above is
>> more
>>>>> or less useful than what we have today, which is summary lines of the
>>>> form:
>>>>>
>>>>> <date> <user> <email>
>>>>
>>>> I already said I would prefer things like
>>>>    Patch related to PR323
>>>> as the patch subject lines.  No one argues that the current state of
>>>> affairs is good.  I argue that replacing this with often wrong and
>>>> irrelevant information isn't the best we can do.
>>>>
>>>
>>> How about using the first line that isn't a ChangeLog date/author line,
>>> without trying to rewrite/augment it?
>>>
>>> Jason
>>>
>>
>> It would certainly be another way of doing it.  Sometimes it would
>> produce almost the same as an unadulterated PR; sometimes it would
>> produce something more meaningful and sometimes it would be pretty
>> useless.  It probably would hit more cases than my current script in
>> that it wouldn't require the commit to mention a PR in it.
>>
>> The main problem is that the first line is often incomplete, and much of
>> it is also wasted with elements like the full path to a file that is
>> quite deep in the tree.  Lets take a quick example (the first I found in
>> the dump I have).
>>
>> 1998-12-17  Vladimir N. Makarov  <vmakarov@cygnus.com>
>>          * config/i60/i960.md (extendqihi2): Fix typo (usage ',' instead of
>>          ';').
>> 1998-12-17  Michael Tiemann  <tiemann@axon.cygnus.com>
>>          * i960.md (extend*, zero_extend*): Don't generate rtl that looks
>>          like (subreg:SI (reg:SI N) 0), because it's wrong, and it hides
>>          optimizations from the combiner.
>>
>> Firstly, this example misses a blank line between the author and the
>> change message itself, which makes distinguishing between this and the
>> multiple authors case harder.  Secondly, the entry really spans two
>> lines and cutting it off at the end of the first line would be, well a
>> bit odd.  We could try to piece things together more, by combining lines
>> until we find a sentence end ( \.$ or \.\s\s ), and we could also strip
>> off more of the path components to just leave the name of the file
>> committed.  For example,
>>
>> i960.md (extendqihi2): Fix typo (usage ',' instead of ';').
>>
>> That might work better, but obviously it's harder to handle and thus
>> more likely to create erroneous summaries.
>>
> 
> Yep. I don't think we need to worry about getting optimal one-line
> summaries for ancient commits; something reasonably unique should be plenty.
> 

Attached is the latest version of my script.  I used (very nearly) this 
to produce a conversion over the weekend and I've uploaded that here:

https://gitlab.com/rearnsha/gcc-trial-20191130

Note, that I might blow this away at any time.  IT IS NOT A FINAL 
CONVERSION.

Some other things to note:
- there are a number of known issues with the version of reposurgeon 
used for this that are being worked on
   - emptycommit-* tags - my control script was out-of-date
   - *deleted* branches - this is being worked on
   - weird dependencies around merges - this is being worked on
   - author attributions are sometimes incorrect - reported

The main difference between the attached script and the one I used for 
this conversion is that ChangeLog change that contain :: inside a 
function list is now handled correctly, resulting in a number of cases 
that were previously being missed now being correctly handled.

Choices I made:
- When a PR is used to derive the summary, I prefix this with 're' (as 
in the Latin 'in re'.
- long change hunks produce poor summaries.  To reduce the overhead:
   - path names are removed, leaving just the final file name
   - multiple files are replaced with [...] after the first filename
   - similarly, multiple function names are replaced with [...]
   - very long comments are truncated, preferably at the strongest
     punctuation mark, but sometimes after key words, such as 'if',
     'when', 'unless' and a few more.  Ultimately, if the line is
     still too long, we just break after an arbitrary space.
- Where possible useful summary lines that appear after an author,
   attribution are hoisted as a summary.
- certain key words in otherwise not very useful summary lines are
   also spotted and used to add [revert] or [backport] annotations to
   the summary.

No changes are made to the main commit log, if we add a new summary 
line, the entire original text is kept.

An example of a summary produced by this is for the commit to r278572, 
where the original log entry is:


     	Backported from mainline
     	2019-08-02  Jakub Jelinek  <jakub@redhat.com>

     	* quadmath.h (M_Eq, M_LOG2Eq, M_LOG10Eq, M_LN2q, M_LN10q, M_PIq,
     	M_PI_2q, M_PI_4q, M_1_PIq, M_2_PIq, M_2_SQRTPIq, M_SQRT2q,
     	M_SQRT1_2q): Use two more decimal places.

And the script then generates:

[backport] quadmath.h (M_Eq, [...]): Use two more decimal places.

as the summary.

R.

--------------2B67D9E0B5C37962D655DDF4
Content-Type: text/x-python;
 name="bugdb.py"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
 filename="bugdb.py"
Content-length: 31670

#!/usr/bin/python2

import requests
import json
import os.path
import codecs
import email.message, email.parser
import re
import signal
import sys

# Files that we read or write.
buglist = "bugids.list"
bugdbfile = "bugs.json"
commits_in = "gcc.commitlog"
commits_out = "gcc.fixedprs"
commits_unfixed = "gcc.unfixed"

bugdb = {}

# Dictionary indexed by bugzilla component.  If the claimed PR component
# matches against the real component, then assume this probably an OK
# matching.  We avoid a lot of false positive warnings this way.
# A special match of "*" means we don't care in this case.  Many bugs
# are filed under, for example, bootstrap or other when in fact the
# real component will ultimately be shown in the commit log.
compalias = {
    "rtl-optimization": ["opt", "optimization", "optimisation", "optimize",
                         "middle-end", "rtl-opt", "rtl_optimization",
                         "rtl-optimizations", "rtl"],
    "tree-optimization": ["opt", "optimization", "optimisation", "optimize",
                          "middle-end", "tree-opt", "tree-optimzation",
                          "graphite", "lto", "tree-optimizations",
                          "tree-optimizer", "ipa", "tree-optimize",
                          "tree-optimizatio"],
    "java": ["gcj", "libgcj", "libjava"],
    "fastjar": ["java"],
    "libgcj": ["java", "libjava"],
    "classpath": ["libgcj", "libjava", "xml", "crypto"],
    "awt": ["classpath", "libjava"],
    "xml": ["classpath", "libjava"],
    "crypto": ["classpath", "libjava"],
    "objc++": ["obj-c++"],
    "c++": ["libstdc++", "diagnostic", "cp"],
    "c": ["diagnostic", "c-family"],
    "libstdc++": ["c++", "libstdc++-v3"],
    "libmudflap": ["mudflap"],
    "preprocessor": ["cpp", "pch"],
    "bootstrap": ["*"],
    "testsuite": ["*"],
    "regression": ["*"],
    "sanitizer": ["sanitize", "ubsan"],
    "other": ["*"],
    "middle-end": ["optimization", "rtl-optimization", "tree-optimization",
                   "tree-opt", "rtl-opt", "ipa", "lto", "middle", "middle-ed"],
    "ipa": ["middle-end", "lto"],
    "lto": ["middle-end", "ipa", "tree-optimization"],
    "target": ["*"],
    "plugins": ["*"],
    "libf2c": ["fortran", "gfortran", "libgfortran", "libfortran"],
    "fortran": ["gfortran", "libgfortran", "libfortran", "forrtran"],
    "libfortran": ["fortran", "gfortran", "libgfortran", "libfofortran"],
    "libgomp": ["fortran"],
    "gcov-profile": ["profile", "gcov"],
    "libffi": ["other"],
    "web": ["documentation"]}

# List of Legacy-IDs where the PR is correct, dispite the anomaly
# checks we perform.
whitelist = ["52862", "53125", "57958", "57959", "58705", "51884", "58535",
             "58542",
             "61329", "61770", "61960", "61984", "63783", "65127", "65128",
             "67653", "68236", "68239",
             "70062", "70542", "70625", "70932", "72072", "72259", "72260",
             "73504", "73940", "73940", "77602", "78312", "78313",
             "80793", "82490", "82518", "83609", "84217", "84295", "84531",
             "84532", "84887", "84895", "86396", "86429", "86458", "86459",
             "86460", "86461", "87066", "87189", "87511", "87512", "87632",
             "88922", "89175", "89921",
             "90004", "91308", "91504", "92158", "92585", "92658", "94054",
             "94226", "94317", "94805", "95382", "95383", "95428", "95673",
             "96396", "96626", "97373", "97375", "97537", "98412", "98926",
             "99480", "99514",
             "100206", "100207", "100209", "100371", "100851", "100961",
             "101264", "101273", "102155", "102348", "102539", "102567",
             "102799", "102920", "103268", "103279", "103308", "104948",
             "104949", "104988", "105052", "105274", "106269", "106582",
             "107816", "108045", "109166", "109167", "109168", "109169",
             "109643", "109644", "109818",
             "110131", "110132", "110627", "110975", "111818", "111947",
             "111952", "112296", "112298", "112611", "112612", "112819",
             "113095", "113119", "113121", "113137", "113627", "113628",
             "113661", "113662",
             "113701", "113702", "113703", "114482", "115200",
             "116613", "116633", "116633", "117824",
             "117852", "118025", "118034", "118142",
             "118742", "118743", "119248", "119485", "119486",
             "120497", "120498", "120499", "120523", "121334", "121338",
             "121949", "122640", "122641", "122700", "122743", "122751",
             "122798", "124362", "125025",
             "125108", "125844", "126675", "127812", "127688", "127689",
             "128272", "128573", "128649", "129871",
             "130016", "131781", "132577", "132587",
             "141646", "142724", "142725", "142833", "144976",
             "152550", "157199", "157784",
             "160456", "160476", "163777",
             "176083", "176167", "176623", "176624",
             "182943", "187842",
             "190100", "193620", "194102", "195610", "199972",
             "201377", "201378", "202642", "202937", "205753", "209250",
             "210999", "211000", "211001", "211002", "214734", "214958",
             "215345", "215386", "215804", "215806", "215807",
             "221207", "221958",
             "233033",
             "242543", "242795",
             "259105",
             "268434",
             "276571", "276987", "276993"
]

# Indexed by Legacy-ID.  Each fixup may contain either the correct PR
# for this commit or an alternative summary string.  If the latter, this
# will form the entire summary for the commit.
fixups = {
    "15985": {"ignore": True},
    "37228": {"summary": "Use memset/memcmp instead of bzero/bcmp."},
    "52264": {"PR": "5373"}, # PR ok, but malformed.
    "57986": {"PR": "7792"},
    "52818": {"PR": "6479"},
    "58876": {"PR": "8481"}, # PR ok, but hard to parse.
    "65177": {"PR": "8808"},
    "65181": {"PR": "8808"},
    "68117": {"PR": "10712"},
    "68118": {"PR": "10712"},
    "71157": {"PR": "11867"},
    "71159": {"PR": "11867"},
    "71856": {"PR": "11415"},
    "75457": {"PR": "12815"},
    "81771": {"PR": "15204"},
    "82028": {"summary": "Multiple fixes: PRs 14692, 15274 and 15463"},
    "84508": {"PR": "16455"},
    "84522": {"PR": "15754"},
    "85966": {"PR": "16935"},
    "86431.1": {"summary": "Revert earlier fix for PR 14029"},
    "86431.2": {"summary": "Revert earlier fix for PR 14029"},
    "98287": {"PR": "17472"},
    "93620": {"PR": "19009"},
    "94658": {"PR": "19736"},
    "94889": {"PR": "19634"},
    "96560": {"PR": "20490"},
    "107451": {"PR": "24236"},
    "110063": {"summary": "PR25024, PR20881, PR23308, PR25538 and PR25710 - Procedure references"},
    "114667": {"PR": "27648"},
    "115858": {"PR": "28452"},
    "122127": {"PR": "30795"},
    "122980": {"PR": "31025"},
    "141612": {"PR": "32519"},
    "146134": {"PR": "38668"},
    "146593": {"PR": "39632"},
    "146650": {"PR": "39632"},
    "151112": {"PR": "28039"},
    "158393": {"PR": "43741"},
    "171924": {"PR": "48401"},
    "174228": {"PR": "46005"},
    "174235": {"PR": "46005"},
    "179230": {"PR": "45012"},
    "188077": {"PR": "52700"},
    "188078": {"PR": "52700"},
    "188079": {"PR": "52700"},
    "206073": {"PR": "35545"},
    "206074": {"PR": "35545"},
    "224527": {"PR": "56766"},
    "268301": {"PR": "89020"},
    "249711": {"summary": "Revert: backport PRs 80382, 80966"},
    "274921": {"summary": "[backport] Enable math functions linking with static library for LTO"},
    "276627": {"PR": "47054"}
    }

full_author = re.compile (r"\s*((Mon|Tue|Wed|Thu|Fri|Sat|Sun)\s+(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+\d{1,2}\s+\d{2}:\d{2}:\d{2}\s+([A-Z]{3}\s+)?\d{4}|\d{4}-\d{2}-\d{2})\s+.+\s[(<][\w.]+@[\w.-]+[>)]")

def read_bugdb(dbfile):
    try:
        with open(dbfile, "r") as f:
            bugdb.update(json.load(f))
            print(len(bugdb))
            f.close()
    except:
        return

def write_bugdb(dbfile):
    try:
        os.delete(dbfile + ".bak")
    except:
        pass
    try:
        os.rename(dbfile, dbfile + ".bak")
    except:
        pass
    try:
        with open(dbfile, "w") as f:
            json.dump(bugdb, f)
            f.close()
    except:
        print("Failed to write" + dbfile)
        exit()

def fetch_sublist(sublist):
    url = 'https://gcc.gnu.org/bugzilla/rest.cgi/bug?id='
    url += ','.join([elem for elem in sublist])
    url += '&include_fields=id,summary,component,resolution'

    resp = requests.get(url)

    if resp.status_code != 200:
        print(resp.status_code)
    return resp.json()

def process(jdata):
    buglist = jdata['bugs']
    for bug in buglist:
        bugdb[str(bug['id'])] = {"summary": bug['summary'],
                                 "component": bug['component'],
                                 "resolution": bug['resolution']}

    print(len(bugdb))

def fetchall(bugids):
    f = open(bugids, "r")
    allbugids = f.read().splitlines()
    sublist = []
    for bugid in allbugids:
        if str(bugid) not in bugdb:
            #print("missing: " + id)
            sublist.append(bugid)
            if len(sublist) == 200:
                process(fetch_sublist(sublist))
                sublist = []
    for fixup in fixups.values():
        if "PR" in fixup and fixup['PR'] not in bugdb:
            sublist.append(fixup['PR'])
            if len(sublist) == 200:
                process(fetch_sublist(sublist))
                sublist = []
    if sublist:
        process(fetch_sublist(sublist))

# This is lifted from the python version of reposurgeon, since we are
# parsing objects it created.

# Since we assume python2
rspolicy = None

class RepoSurgeonEmail(email.message.Message, object):
    "Specialized email message with a distinguishing starter."
    Divider = 78 * "-"
    __hash__ = None
    def __init__(self, **kwargs):
        if rspolicy:
            # This activates the Python 3 email header handling hack
            kwargs['policy'] = rspolicy
        email.message.Message.__init__(self, **kwargs)
        self.set_unixfrom(RepoSurgeonEmail.Divider)
    @staticmethod
    def readmsg(fp):
        msg = ''
        firstline = fp.readline()
        if not firstline:
            return None
        elif not firstline.startswith(RepoSurgeonEmail.Divider):
            msg = firstline
        while True:
            line = fp.readline()
            if not line:
                break
            # Not impossible for this to get spoofed, but *highly* unlikely
            if line == "\n." + RepoSurgeonEmail.Divider + "\n":
                line = "\n" + RepoSurgeonEmail.Divider + "\n"
            if line.startswith(RepoSurgeonEmail.Divider):
                break
            msg += line
        return msg
    def payload(self):
        "Return message body - works because not multipart."
        return self.get_payload()
    def __str__(self):
        out = super(RepoSurgeonEmail, self).as_string(unixfrom=True)
        out = out.replace("\n" + RepoSurgeonEmail.Divider + "\n",
                          "\n." + RepoSurgeonEmail.Divider + "\n")
        out = out.replace("\n>From", "\nFrom")
        return out

# Close states that suggest that the bug is unlikely to be a correct
# target of a fix.
badclose = ["INVALID"]
# Would be nice to add ["WORKSFORME", "WONTFIX"], but too many false
# positives on early PRs.

# Fuzzy match a claimed bugcat against the component in the PR.  This avoids
# a number of likely false negatives without over relaxing the desire to
# detect possibly inaccurate PR numbers.
def component_mismatch(bugcat, bugid):
    # We're not bothered about case
    bugcat = bugcat.lower()
    component = bugdb[bugid]['component']
    # It's extremely unlikely that any bug classified as spam should match
    # a PR.  It's also pretty unlikely that a bug closed as INVALID would
    # do so either.  So flag up both cases.
    if (component == "spam" or bugdb[bugid]['resolution'] in badclose):
        return True
    # These blanket components lead to far too many false positives, so
    # just ignore them.  This is essentially the reverse map of the "*"
    # match later on.
    if (bugcat in ["boostrap", "bootstrap", "target", "plugins", "testsuite",
                   "other", "ice", "gcc", "translation", "regression", "build",
                   "diagnostic", "wrong-code"] or
        # These are nonesense categories, since they refer to
        # components outside of GCC.
        bugcat in ["binutils", "gas", "ld"] or
        bugcat == component):
        return False
    if component not in compalias:
        return True
    if ("*" in compalias[component] or
        bugcat in compalias[component]):
        return False
    return True

def summary_from_pr(msg, payload, lines):
    # Deal first with a first line that is a PR and a summary of its own,
    # we don't want to replace that.
    if (re.match(r"^\s*PR(\s+([^/\s]+)/|\s*#?)\d+\b", lines[0])
        and not re.match(r"^\s*PR(\s+([^/\s]+)/|\s*#?)\d+\s*[-:.]?\s*$",
                         lines[0])):
        return None

    y = re.search(r"(^|\s+|[/\(\[])PR(\s+([^/\s]+)/|\s*#?)(\d+)\b",
                  payload)
    if not y:
        return None
    bugcat = y.group(3)
    bugid = y.group(4)
    if bugid not in bugdb:
        #print("PR " + bugid + " not found in database")
        return None
    summary = bugdb[bugid]['summary'].encode("utf-8")
    # Early on there were multiple GNATS databases for bugs;
    # later they were merged into a single database, but the
    # PR numbers were changed.  Don't trust bugs below 1000
    # unless the component matches the one mentioned in the
    # log (if it even has one).
    if (int(bugid) < 1000 and
        (bugcat == None or bugcat != bugdb[bugid]['component'])):
        print("E" + msg['Event-Number'] + "(r" + msg['Legacy-ID']
              + "): Ignoring PR " + bugid
              + " - no component or component mismatch")
        return None
    elif (bugcat != None and msg['Legacy-ID'] not in whitelist
          and component_mismatch(bugcat, bugid)):
        problem = ("E" + msg['Event-Number'] + "(r" + msg['Legacy-ID']
                   + "): '" + bugcat + "/" + bugid + "'")
        if (bugdb[bugid]['resolution'] in badclose):
            problem += (" closed as "
                        + bugdb[bugid]['resolution'].encode("utf-8"))
            summary += (" [checkme: "
                        + bugdb[bugid]['resolution'].encode("utf-8")
                        + " SVN r" + msg['Legacy-ID'] + "]")
        else:
            problem += (" mismatch PR "
                        + bugdb[bugid]['component'].encode("utf-8"))
            summary += (" [checkme: " + bugcat + " SVN r"
                        + msg['Legacy-ID'] + "]")
        print(problem)

    z = re.match(r"\s*\[[^\]]*[Rr]egression\s*\]\s*:?\s*(.*)", summary)
    if z:
        #print(summary)
        summary = z.group(1)
    summary = ("re PR " + bugdb[bugid]['component'].encode("utf-8")
               + "/" + bugid + " (" + summary + ")")
    return summary

# Similar to strip_unintersting line, but used when we have a single
# line that looks potentially interesting (because it begins with
# a letter and ends with a period and we want to decide whether we really
# want it as the final summary.
def interesting_line(line):
    if re.match(r"^\s*([Ff]ix(e[ds])? (for )?)?PR(\s+[^/\s]+/|\s*#?)(\d+)[\.:]?\s*$",
                line):
        return False
    if re.match(r"\s*([Ff]ix|[Pp]atch) from", line):
        return False
    if re.match(r"\s*([Mm]erged?|[Bb]ack[-\s]?port(ed|s)?) (\w+ )?from ",
                line):
        return False
    if re.match(r"\s*[Ss]ynchronize with", line):
        return False
    if re.match(r"\s*[Ff]rom ", line):
        return False
    if re.match(r"\s*[Ii]n\s\S+$", line):
        return False
    if re.match(r"\s*(Reviewed|Signed-off|Acked|Reported)-by:", line):
        return False
    if re.match(r"\s*Reviewed-on:", line):
        return False
    return True

def strip_uninteresting_lines(lines):
    backport = False
    revert = False
    keepgoing = True
    # Strip any blank lines, or any line containing a single word, unless
    # that word is revert.  Mostly these make uninteresting summaries.
    # Also strip lines that begin "* From ...:", this was an early way
    # of attributing a contributor who did not make a commit themselves.
    while lines:
        if re.match(r"^\s*$", lines[0]):
            pass
        elif full_author.match(lines[0]):
            # Strip any author lines
            while (len(lines) > 1
                   and re.match(r"\s+\w.*\s<[\w.]+@[\w.-]+>", lines[1])):
                lines = lines[1:]
        elif re.match(r"\s*PR\d+", lines[0]):
            keepgoing = False
        elif re.match(r"\s*\*\s*From\s.*:$", lines[0]):
            pass
        elif re.match(r"\W*([Rr]evert(ed)?|[Bb]ack[-\s]*out)(\.|.*:)?$",
                      lines[0]):
            revert = True
        elif re.match(r"\W*([Bb]ack[-\s]*port(|s|ed)|[Mm]erged?)\b",
                      lines[0]):
            backport = True
        elif re.match(r"\s*(\S+\s*|)$", lines[0]):
            pass
        else:
            keepgoing = False
        if not keepgoing:
            break
        lines = lines[1:]

    return lines, backport, revert

changemax = 80
def find_best_breakpoint(change):
    # Try, in turn, looking for ':', ';', ',' and ' - '.  The idea is
    # to find the strongest natural break within the message.
    # Avoid breaking at '):' This is more likely to be a change hunk that
    # was malformed in some way and produces poor breaks.  Similarly for
    # :: which is often a OO function separator
    x = re.match(r"^(.*?[^\)\]:]:)\s", change)
    if not x:
        x = re.match(r"^(.*?;)\s", change)
    if not x:
        x = re.match(r"^(.*?,)\s", change)
    if not x:
        x = re.match(r"^(.*?\s-(\s|$))\s", change)
    if x:
        return x.group(1)[0:len(x.group(1)) - 1]

    # Consider words that create subclauses, breaks after these are more
    # natural, but not if they are too early in the line.
    x = re.match(r"^(.*?\b(if|when|unless))(\s|$)",
                 change[changemax / 2:changemax])
    if x:
        return change[:(changemax / 2)] + x.group(1)

    x = re.match(r"^(.*\s)", change[:changemax])
    if x:
        return x.group(1)[0:len(x.group(1)) - 1]

    return change[:changemax]

badlines=[r"\s*$",
          r"\s*\*(\s|.*:(\s|$))",
          r"\s*\((\S+,\s+)*\S+\):",
          r"\s*/?(gcc|ada|testsuite|boehm-gc)(/[-\+\w])*/?:?$",
          r"\s*/?(brig|c|[cC]\+\+(filt)?|cfamily|ch)(/[-\+\w])*/?:?$",
          r"\s*/?(config|contrib|cp|doc|f|F77|fastjar)(/[-\+\w])*/?:?$",
          r"\s*/?([Ff]ortran|gcp|gfortran|gnattools)(/[-\+\w])*/?:?$",
          r"\s*/?(go|gotools|include|intl|java)(/[-\+\w])*/?:?$",
          r"\s*/?lib(ada|asan|atomic|backtrace|banshee)(/[-\+\w])*/?:?$",
          r"\s*/?lib(cc1|cpp|ffi|f2c|gcc|fortran|go|gomp)(/[-\+\w])*/?:?$",
          r"\s*/?lib(hsail-rt|iberty|io|itm|java|mudflap)(/[-\+\w])*/?:?$",
          r"\s*/?lib(objc|offloadmic|phobos|quadmath)(/[-\+\w])*/?:?$",
          r"\s*/?lib(sanitizer|ssp|stdc\+\+|vtv)(/[-\+\w])*/?:?$",
          r"\s*/?(lto|lto-plugin|maintainer-scripts|objc)(/[-\+\w])*/?:?$",
          r"\s*/?(src|zlib)(/[-\+\w])*/?:?$",
          r"\s\[gcc[^\]]*\]$",
          r"\(top level\)$"]

# Return the first sentence, and any remaining text.
def find_first_sentence(lines):
    # Never merge a PR number with additional lines
    if re.match(r"\s*PR(\s+[^/\s]+/|\s*#?)(\d+)", lines[0]):
        return lines[0], lines[1:]
    sentence = lines[0]
    x = re.match(r"\s*(.*?[\.\?!])(\s+(.*)|$)", sentence)
    if x:
        if x.group(3):
            return x.group(1), [x.group(3)] + lines[1:]
        return x.group(1), lines[1:]
    lines = lines[1:]
    while lines:
        for badline in badlines:
            if re.match(badline, lines[0]):
                return sentence, lines
        x = re.match(r"\s*(.*?\.)(\s+(.*)|$)", lines[0])
        if x:
            if x.group(3):
                return sentence + " " + x.group(1), [x.group(3)] + lines[1:]
            return sentence + " " + x.group(1), lines[1:]
        sentence += " " + re.match(r"\s*(\S.*)$", lines[0]).group(1)
        lines = lines[1:]
    return sentence, lines

def trim_excess_ws(mystr):
    # Strip any leading white space.
    mystr = re.sub(r"^\s+", "", mystr)
    # Canonicalize all white space into single space characters
    mystr = re.sub(r"\s{2,}", " ", mystr)
    # And any trailing white space
    mystr = re.sub(r"\s+$", "", mystr)
    return mystr

def simplify_funclist(funcstr):
    funcstr = trim_excess_ws(funcstr)
    # Sanity check
    if funcstr[0] != "(" or funcstr[-1] != ")":
        return funcstr
    funcstr = funcstr[1:-1]
    firstfunc = re.match(r"^([^\(]+?(\([^\)]*\))?),\s(.*)", funcstr)
    if firstfunc and len(firstfunc.group(3)) > 10:
        return " (" + firstfunc.group(1) + ", [...])"
    return " (" + funcstr + ")"

def simplify_filelist(filestr):
    filestr = trim_excess_ws(filestr)
    x = re.match(r"(\S+/)?(.+?)((,)?\s+(.*)|$)", filestr)
    if not x:
        return filestr
    sep = ", " if x.group(4) else " "
    if not x.group(5):
        return x.group(2)
    y = re.match(r"(\S+/)?(.+?)(,?\s+(.*)|$)", x.group(5))
    # If the first two 'files' do not contain a '.', then we might not have
    # a real file list.  Better to just return the original string.
    if sep == " " and not (re.match(r"\S+\.\S", x.group(2))
                           or re.match(r"\S+\.\S", y.group(2))):
        return filestr
    if y.group(4) or len(y.group(2)) > 10:
        return x.group(2) + sep + "[...]"
    return x.group(2) + sep + y.group(2)

def summary_from_change(msg, sentence):
    # Pattern to match a commit change entry in GNU ChangeLog style.  There
    # are some variants of this, but they are rare enough that we don't
    # worry too much about not handling them.
    # This RE is just too long to comprehend when written as a single
    # string...
    re_files = r"([-=\.\w\d_/\*\?\+,\s\[\]\{\}\$]+)" # 1 group
    re_funcs = r"(\s*\(.*?\))?"           # 1 group
    re_cases = r"\s*(\[[^\]]+\])?"        # 1 group (unused)
    re_case2 = r"(<[^>]+>)?"              # 1 group (unused)
    re_descr = r"\s*:(\s+(.*))?$"         # 2 groups
    x = re.match(r"\s*\*\s*" + re_files + re_funcs + re_cases + re_case2
                 + re_descr,
                 sentence)
    if not x:
        # A relatively common typing error is to leave out the closing
        # ')' from a function list.  Try again for that option.  We
        # use a simpler match pattern for this case, leaving out the
        # cases variants.
        x = re.match(r"\s*\*\s*" + re_files + r"(\s*\([^\)]*?)" + re_descr,
                     sentence)
        if not x:
            print ("Failed to parse: " + sentence)
            return None
        filename = simplify_filelist(x.group(1))
        funcs = simplify_funclist(x.group(2) + ")")
        change = x.group(4)
    else:
        filename = simplify_filelist(x.group(1))
        funcs = simplify_funclist(x.group(2)) if x.group(2) else ""
        change = x.group(6) if x.group(6) else ""
    # Just use the basename for any filename.

    if len(change) < changemax + 5:
        return trim_excess_ws(filename + funcs + ": " + change)
    while (len(change) > changemax):
        change = find_best_breakpoint(change)
    change += "..."
    return trim_excess_ws(filename + funcs + ": " + change)

def find_best_summary(msg, lines):
    sentence, _ = find_first_sentence(lines)
    # ChangeLog entries look like "* file[ (func-list)]: change.
    if re.match(r"\s*\*\s*.+:\s", sentence):
        return summary_from_change(msg, sentence)

    return None

def already_good_summary(lines):
    while lines and lines[0] == "":
        lines = lines[1:]
    # No non-blank lines.  Nothing we can do to make this good, so best
    # to stop now.
    if not lines:
        return True
    # Anything created by cvs2svn should be left alone
    if re.search(r"by cvs2svn", lines[0]):
        return True
    # Single word lines, committer tags or simple PR tags don't make a
    # good summary
    if (full_author.match(lines[0])
        or re.match(r"\s*PR(\s+[^/\s]+/|\s*#?)(\d+)\s*$", lines[0])
        or re.match(r"s*\S+\s*$", lines[0])):
        return False
    # We likely can't make a single-line commit message better.
    if len(lines) == 1:
        return True
    if re.match(r"\s*\*", lines[0]):
        return False
    # A 'real' sentence, ie more than one word and ending with a
    # punctuation mark.  We don't need a blank line in this case as
    # reposurgeon's gitify will handle this for us.
    if re.match(r"\s*[-\w\[\(\{\]\)\}_,.:\']\S*\s+\S.*[.:;,?!]$", lines[0]):
        return interesting_line(lines[0])
    if not interesting_line(lines[0]):
        return False
    # If the second line is blank, treat this as good.
    if re.match(r"\s*$", lines[1]):
        return True
    # It might be a single-line summary that is wrapped onto more than
    # one line.  Handle this case as it's unlikely we can improve it.
    # If the line looks overly long, then we may want to produce a shorter
    # version, so punt.
    sentence, remnant = find_first_sentence(lines)
    if (len(sentence) < 120
        and remnant
        and re.match(r"\s*$", remnant[0])):
        return True
    return False

def summary_from_sentence(sentence, remnant):
    # Ignore anything that looks like a change hunk.
    if re.match(r"\s*\*", sentence):
        return None
    if not interesting_line(sentence):
        return None
    sentence = trim_excess_ws(sentence)
    length = len(sentence)
    if length < 120 and not remnant:
        return None
    elif length < 120:
        return sentence

    return find_best_breakpoint(sentence) + "..."

def processmsg(msg):
    payload = msg.get_payload()
    summary = None

    # Ignore events tagged with "emptycommit-*", these will be deleted
    # later on.
    if 'Tag-Name' in msg and re.match(r"^emptycommit-", msg['Tag-Name']):
        return None
    # Look to see if this event has a specific fixup.  Use it if so.
    if msg['Legacy-ID'] in fixups:
        fixup = fixups[msg['Legacy-ID']]
        if 'ignore' in fixup:
            return None
        if "PR" in fixup:
            summary = bugdb[fixup['PR']]['summary'].encode("utf-8")
            z = re.match(r"\s*\[[^\]]*[Rr]egression\s*\]\s*:?\s*(.*)", summary)
            if z:
                summary = z.group(1)
            summary = ("re PR "
                       + bugdb[fixup['PR']]['component'].encode("utf-8")
                       + "/" + fixup['PR'] + " (" + summary + ")")
        elif "summary" in fixup:
            summary = fixup['summary']
        else:
            print("Invalid fixup for Legacy-ID " + msg['Legacy-ID'])
            exit()

    lines = payload.splitlines()
    # Don't mess with something that looks like a good summary.
    if not summary and already_good_summary(lines):
        return None

    while lines and re.match(r"\s*$", lines[-1]):
        lines = lines[:-1]

    # Look to see if this event is a backport of multiple PRs.  Just
    # set the summary to a quick list if so.
    if not summary:
        allprs = re.findall(r"\b\s*PR(\s+([^/\s]+)/|\s*#?)(\d+)\b", payload)
        allprnos = list(dict.fromkeys([int(pr[2]) for pr in allprs]))
        allprnos.sort()
        if ((len(allprnos) > 3 and
             re.search(r"\b[Bb]ack[- ]?port(s|ed)?\b", payload))
            or (len(allprnos) > 1 and
                re.search(r"\b[Bb]ack[- ]?ports", payload))):
            prs = allprnos if len(allprnos) < 10 else allprnos[:10]
            #print prs[0]
            summary = "Backport PRs " + str(prs[0])
            for pr in prs[1:]:
                summary += ", " + str(pr)
            if len(allprnos) > 10:
                summary += " and more"

    backport = False
    revert = False
    if not summary:
        while lines and lines[0] == "":
            lines = lines[1:]
        if not lines:
            return None
        multicommit = False
        x = full_author.match(lines[0])
        author_line = lines[0]
        if x:
            if len(lines) > 2:
                is_backport = re.search(r"\b[Bb]ack[- ]?port", payload)
                for line in lines[2:]:
                    if full_author.match(line):
                        # If this looks like a backport, allow one
                        # mismatch, since often we have the author of
                        # the original commit and the author of the
                        # backport its rare to get a full match in
                        # that case.
                        if is_backport and author_line != line:
                            is_backport = None
                            author_line = line
                        elif author_line != line and lines[0] != line:
                            multicommit = True
                            break
        if multicommit:
            summary = "[multiple changes]"
            bugid = ""
            if author_line != lines[0]:
                backport = True

    if not summary:
        lines, backport, revert = strip_uninteresting_lines(lines)
        if not lines:
            return None
        sentence, remnant = find_first_sentence(lines)
        summary = summary_from_sentence(sentence, remnant)

    if not summary:
        summary = summary_from_pr(msg, payload, lines)

    if not summary:
        summary = find_best_summary(msg, lines)

    if not summary:
        return None

    preamble = ""
    if revert:
        preamble = "[revert]"
    if backport:
        preamble += "[backport]"
    if preamble != "":
        if summary[0] == "[":
            summary = preamble + summary
        else:
            summary = preamble + " " + summary

    newmsg = RepoSurgeonEmail()

    # All messages should have an event number, so this will hopefully
    # raise an exception if it fails
    newmsg['Event-Number'] = msg['Event-Number']
    if "Check-Text" in msg:
        newmsg['Check-Text'] = msg['Check-Text']
    # We must copy the tag name if it exists.
    if "Tag-Name" in msg:
        newmsg['Tag-Name'] = msg['Tag-Name']
    try:
        newmsg.set_payload(summary + "\n\n" + payload)
        return str(newmsg)
    except:
        print("Unicode??? Cannot format message for Event "
              + msg['Event-Number'] +
              (summary if bugid == "" else (" PR " + bugid)))
        return None

def main():
    scrape_bugzilla = True
    for arg in sys.argv[1:]:
        if arg == "--noscrape":
            scrape_bugzilla = False
        else:
            print "Unknown option: " + arg
            exit(1)

    read_bugdb(bugdbfile)
    if scrape_bugzilla:
        try:
            fetchall(buglist)
        except KeyboardInterrupt:
            write_bugdb(bugdbfile)
            exit()

        write_bugdb(bugdbfile)

    commitlog = open(commits_in, "r")
    newlog = open(commits_out, "w")
    unfixed = open(commits_unfixed, "w")

    while True:
        msg = RepoSurgeonEmail.readmsg(commitlog)
        if not msg:
            break
        newmsg = processmsg(email.message_from_string(msg))
        if newmsg:
            newlog.write(newmsg)
        else:
            unfixed.write(msg)

    newlog.close()
    commitlog.close()
    unfixed.close()

if __name__ == '__main__':
    main()

--------------2B67D9E0B5C37962D655DDF4--