From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 73284 invoked by alias); 2 Dec 2019 10:54:28 -0000 Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-owner@gcc.gnu.org Received: (qmail 73271 invoked by uid 89); 2 Dec 2019 10:54:27 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-8.8 required=5.0 tests=AWL,BAYES_00,GIT_PATCH_1,KAM_SHORT,SPF_PASS autolearn=ham version=3.3.1 spammy=cutting, unique, obviously, claimed X-HELO: foss.arm.com Received: from foss.arm.com (HELO foss.arm.com) (217.140.110.172) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 02 Dec 2019 10:54:22 +0000 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id C8D44328; Mon, 2 Dec 2019 02:54:20 -0800 (PST) Received: from e120077-lin.cambridge.arm.com (e120077-lin.cambridge.arm.com [10.2.78.81]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 896263F68E; Mon, 2 Dec 2019 02:54:19 -0800 (PST) Subject: Re: Commit messages and the move to git To: Jason Merrill Cc: Segher Boessenkool , Eric Raymond , Jeff Law , GCC Development References: <20191107142727.GA72444@thyrsus.com> <20191109060151.GA82270@thyrsus.com> <78a57894-74f2-94d5-65a1-3ce2bd934f32@arm.com> <20191118155549.GH16031@gate.crashing.org> <20191118171115.GI16031@gate.crashing.org> <20191118185328.GJ16031@gate.crashing.org> <20191118194450.GL16031@gate.crashing.org> From: "Richard Earnshaw (lists)" Message-ID: <632176b4-0ff6-1389-1576-138700ea900b@arm.com> Date: Mon, 02 Dec 2019 10:54:00 -0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.9.0 MIME-Version: 1.0 In-Reply-To: Content-Type: multipart/mixed; boundary="------------2B67D9E0B5C37962D655DDF4" X-SW-Source: 2019-12/txt/msg00003.txt.bz2 This is a multi-part message in MIME format. --------------2B67D9E0B5C37962D655DDF4 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Content-length: 6943 On 19/11/2019 14:56, Jason Merrill wrote: > On Mon, Nov 18, 2019 at 4:38 PM Richard Earnshaw (lists) < > Richard.Earnshaw@arm.com> wrote: > >> On 18/11/2019 20:53, Jason Merrill wrote: >>> On Mon, Nov 18, 2019 at 2:51 PM Segher Boessenkool < >>> segher@kernel.crashing.org> wrote: >>> >>>> On Mon, Nov 18, 2019 at 07:21:22PM +0000, Richard Earnshaw (lists) >> wrote: >>>>> On 18/11/2019 18:53, Segher Boessenkool wrote: >>>>>> PR target/92140: clang vs gcc optimizing with adc/sbb >>>>>> PR fortran/91926: assumed rank optional >>>>>> PR tree-optimization/91532: [SVE] Redundant predicated store in >>>> gcc.target/aarch64/fmla_2.c >>>>>> PR tree-optimization/92161: ICE in vect_get_vec_def_for_stmt_copy, at >>>> tree-vect-stmts.c:1687 >>>>>> PR tree-optimization/92162: ICE in vect_create_epilog_for_reduction, >>>> at tree-vect-loop.c:4252 >>>>>> PR c++/92015: internal compiler error: in cxx_eval_array_reference, at >>>> cp/constexpr.c:2568 >>>>>> PR tree-optimization/92173: ICE in optab_for_tree_code, at >>>> optabs-tree.c:81 >>>>>> PR tree-optimization/92173: ICE in optab_for_tree_code, at >>>> optabs-tree.c:81 >>>>>> PR fortran/92174: runtime error: index 15 out of bounds for type >>>> 'gfc_expr *[15] >>>>>> >>>>>> Most of these aren't helpful at all, and none of these are good commit >>>>>> summaries. The PR92173 one actually has identical commit messages >> btw, >>>>>> huh. Ah, the second one (r277288) has the wrong changelog, but in the >>>>>> actual changelog file as well, not something any tool could fix up (or >>>>>> have we reached the singularity?) >>>>> >>>>> Identical commits are normally from where the same commit is made to >>>>> multiple branches. It's not uncommon to see this when bugs are fixed. >>>> >>>> This is an actual mistake. The commits are not identical at all, just >>>> the commit messages are (and the changelog entries, too). Not something >>>> that happens to ften, but of course I hit it in the first random thing I >>>> pick :-) >>>> >>>>> Ultimately the question here is whether something like the above is >> more >>>>> or less useful than what we have today, which is summary lines of the >>>> form: >>>>> >>>>> >>>> >>>> I already said I would prefer things like >>>> Patch related to PR323 >>>> as the patch subject lines. No one argues that the current state of >>>> affairs is good. I argue that replacing this with often wrong and >>>> irrelevant information isn't the best we can do. >>>> >>> >>> How about using the first line that isn't a ChangeLog date/author line, >>> without trying to rewrite/augment it? >>> >>> Jason >>> >> >> It would certainly be another way of doing it. Sometimes it would >> produce almost the same as an unadulterated PR; sometimes it would >> produce something more meaningful and sometimes it would be pretty >> useless. It probably would hit more cases than my current script in >> that it wouldn't require the commit to mention a PR in it. >> >> The main problem is that the first line is often incomplete, and much of >> it is also wasted with elements like the full path to a file that is >> quite deep in the tree. Lets take a quick example (the first I found in >> the dump I have). >> >> 1998-12-17 Vladimir N. Makarov >> * config/i60/i960.md (extendqihi2): Fix typo (usage ',' instead of >> ';'). >> 1998-12-17 Michael Tiemann >> * i960.md (extend*, zero_extend*): Don't generate rtl that looks >> like (subreg:SI (reg:SI N) 0), because it's wrong, and it hides >> optimizations from the combiner. >> >> Firstly, this example misses a blank line between the author and the >> change message itself, which makes distinguishing between this and the >> multiple authors case harder. Secondly, the entry really spans two >> lines and cutting it off at the end of the first line would be, well a >> bit odd. We could try to piece things together more, by combining lines >> until we find a sentence end ( \.$ or \.\s\s ), and we could also strip >> off more of the path components to just leave the name of the file >> committed. For example, >> >> i960.md (extendqihi2): Fix typo (usage ',' instead of ';'). >> >> That might work better, but obviously it's harder to handle and thus >> more likely to create erroneous summaries. >> > > Yep. I don't think we need to worry about getting optimal one-line > summaries for ancient commits; something reasonably unique should be plenty. > Attached is the latest version of my script. I used (very nearly) this to produce a conversion over the weekend and I've uploaded that here: https://gitlab.com/rearnsha/gcc-trial-20191130 Note, that I might blow this away at any time. IT IS NOT A FINAL CONVERSION. Some other things to note: - there are a number of known issues with the version of reposurgeon used for this that are being worked on - emptycommit-* tags - my control script was out-of-date - *deleted* branches - this is being worked on - weird dependencies around merges - this is being worked on - author attributions are sometimes incorrect - reported The main difference between the attached script and the one I used for this conversion is that ChangeLog change that contain :: inside a function list is now handled correctly, resulting in a number of cases that were previously being missed now being correctly handled. Choices I made: - When a PR is used to derive the summary, I prefix this with 're' (as in the Latin 'in re'. - long change hunks produce poor summaries. To reduce the overhead: - path names are removed, leaving just the final file name - multiple files are replaced with [...] after the first filename - similarly, multiple function names are replaced with [...] - very long comments are truncated, preferably at the strongest punctuation mark, but sometimes after key words, such as 'if', 'when', 'unless' and a few more. Ultimately, if the line is still too long, we just break after an arbitrary space. - Where possible useful summary lines that appear after an author, attribution are hoisted as a summary. - certain key words in otherwise not very useful summary lines are also spotted and used to add [revert] or [backport] annotations to the summary. No changes are made to the main commit log, if we add a new summary line, the entire original text is kept. An example of a summary produced by this is for the commit to r278572, where the original log entry is: Backported from mainline 2019-08-02 Jakub Jelinek * quadmath.h (M_Eq, M_LOG2Eq, M_LOG10Eq, M_LN2q, M_LN10q, M_PIq, M_PI_2q, M_PI_4q, M_1_PIq, M_2_PIq, M_2_SQRTPIq, M_SQRT2q, M_SQRT1_2q): Use two more decimal places. And the script then generates: [backport] quadmath.h (M_Eq, [...]): Use two more decimal places. as the summary. R. --------------2B67D9E0B5C37962D655DDF4 Content-Type: text/x-python; name="bugdb.py" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="bugdb.py" Content-length: 31670 #!/usr/bin/python2 import requests import json import os.path import codecs import email.message, email.parser import re import signal import sys # Files that we read or write. buglist = "bugids.list" bugdbfile = "bugs.json" commits_in = "gcc.commitlog" commits_out = "gcc.fixedprs" commits_unfixed = "gcc.unfixed" bugdb = {} # Dictionary indexed by bugzilla component. If the claimed PR component # matches against the real component, then assume this probably an OK # matching. We avoid a lot of false positive warnings this way. # A special match of "*" means we don't care in this case. Many bugs # are filed under, for example, bootstrap or other when in fact the # real component will ultimately be shown in the commit log. compalias = { "rtl-optimization": ["opt", "optimization", "optimisation", "optimize", "middle-end", "rtl-opt", "rtl_optimization", "rtl-optimizations", "rtl"], "tree-optimization": ["opt", "optimization", "optimisation", "optimize", "middle-end", "tree-opt", "tree-optimzation", "graphite", "lto", "tree-optimizations", "tree-optimizer", "ipa", "tree-optimize", "tree-optimizatio"], "java": ["gcj", "libgcj", "libjava"], "fastjar": ["java"], "libgcj": ["java", "libjava"], "classpath": ["libgcj", "libjava", "xml", "crypto"], "awt": ["classpath", "libjava"], "xml": ["classpath", "libjava"], "crypto": ["classpath", "libjava"], "objc++": ["obj-c++"], "c++": ["libstdc++", "diagnostic", "cp"], "c": ["diagnostic", "c-family"], "libstdc++": ["c++", "libstdc++-v3"], "libmudflap": ["mudflap"], "preprocessor": ["cpp", "pch"], "bootstrap": ["*"], "testsuite": ["*"], "regression": ["*"], "sanitizer": ["sanitize", "ubsan"], "other": ["*"], "middle-end": ["optimization", "rtl-optimization", "tree-optimization", "tree-opt", "rtl-opt", "ipa", "lto", "middle", "middle-ed"], "ipa": ["middle-end", "lto"], "lto": ["middle-end", "ipa", "tree-optimization"], "target": ["*"], "plugins": ["*"], "libf2c": ["fortran", "gfortran", "libgfortran", "libfortran"], "fortran": ["gfortran", "libgfortran", "libfortran", "forrtran"], "libfortran": ["fortran", "gfortran", "libgfortran", "libfofortran"], "libgomp": ["fortran"], "gcov-profile": ["profile", "gcov"], "libffi": ["other"], "web": ["documentation"]} # List of Legacy-IDs where the PR is correct, dispite the anomaly # checks we perform. whitelist = ["52862", "53125", "57958", "57959", "58705", "51884", "58535", "58542", "61329", "61770", "61960", "61984", "63783", "65127", "65128", "67653", "68236", "68239", "70062", "70542", "70625", "70932", "72072", "72259", "72260", "73504", "73940", "73940", "77602", "78312", "78313", "80793", "82490", "82518", "83609", "84217", "84295", "84531", "84532", "84887", "84895", "86396", "86429", "86458", "86459", "86460", "86461", "87066", "87189", "87511", "87512", "87632", "88922", "89175", "89921", "90004", "91308", "91504", "92158", "92585", "92658", "94054", "94226", "94317", "94805", "95382", "95383", "95428", "95673", "96396", "96626", "97373", "97375", "97537", "98412", "98926", "99480", "99514", "100206", "100207", "100209", "100371", "100851", "100961", "101264", "101273", "102155", "102348", "102539", "102567", "102799", "102920", "103268", "103279", "103308", "104948", "104949", "104988", "105052", "105274", "106269", "106582", "107816", "108045", "109166", "109167", "109168", "109169", "109643", "109644", "109818", "110131", "110132", "110627", "110975", "111818", "111947", "111952", "112296", "112298", "112611", "112612", "112819", "113095", "113119", "113121", "113137", "113627", "113628", "113661", "113662", "113701", "113702", "113703", "114482", "115200", "116613", "116633", "116633", "117824", "117852", "118025", "118034", "118142", "118742", "118743", "119248", "119485", "119486", "120497", "120498", "120499", "120523", "121334", "121338", "121949", "122640", "122641", "122700", "122743", "122751", "122798", "124362", "125025", "125108", "125844", "126675", "127812", "127688", "127689", "128272", "128573", "128649", "129871", "130016", "131781", "132577", "132587", "141646", "142724", "142725", "142833", "144976", "152550", "157199", "157784", "160456", "160476", "163777", "176083", "176167", "176623", "176624", "182943", "187842", "190100", "193620", "194102", "195610", "199972", "201377", "201378", "202642", "202937", "205753", "209250", "210999", "211000", "211001", "211002", "214734", "214958", "215345", "215386", "215804", "215806", "215807", "221207", "221958", "233033", "242543", "242795", "259105", "268434", "276571", "276987", "276993" ] # Indexed by Legacy-ID. Each fixup may contain either the correct PR # for this commit or an alternative summary string. If the latter, this # will form the entire summary for the commit. fixups = { "15985": {"ignore": True}, "37228": {"summary": "Use memset/memcmp instead of bzero/bcmp."}, "52264": {"PR": "5373"}, # PR ok, but malformed. "57986": {"PR": "7792"}, "52818": {"PR": "6479"}, "58876": {"PR": "8481"}, # PR ok, but hard to parse. "65177": {"PR": "8808"}, "65181": {"PR": "8808"}, "68117": {"PR": "10712"}, "68118": {"PR": "10712"}, "71157": {"PR": "11867"}, "71159": {"PR": "11867"}, "71856": {"PR": "11415"}, "75457": {"PR": "12815"}, "81771": {"PR": "15204"}, "82028": {"summary": "Multiple fixes: PRs 14692, 15274 and 15463"}, "84508": {"PR": "16455"}, "84522": {"PR": "15754"}, "85966": {"PR": "16935"}, "86431.1": {"summary": "Revert earlier fix for PR 14029"}, "86431.2": {"summary": "Revert earlier fix for PR 14029"}, "98287": {"PR": "17472"}, "93620": {"PR": "19009"}, "94658": {"PR": "19736"}, "94889": {"PR": "19634"}, "96560": {"PR": "20490"}, "107451": {"PR": "24236"}, "110063": {"summary": "PR25024, PR20881, PR23308, PR25538 and PR25710 - Procedure references"}, "114667": {"PR": "27648"}, "115858": {"PR": "28452"}, "122127": {"PR": "30795"}, "122980": {"PR": "31025"}, "141612": {"PR": "32519"}, "146134": {"PR": "38668"}, "146593": {"PR": "39632"}, "146650": {"PR": "39632"}, "151112": {"PR": "28039"}, "158393": {"PR": "43741"}, "171924": {"PR": "48401"}, "174228": {"PR": "46005"}, "174235": {"PR": "46005"}, "179230": {"PR": "45012"}, "188077": {"PR": "52700"}, "188078": {"PR": "52700"}, "188079": {"PR": "52700"}, "206073": {"PR": "35545"}, "206074": {"PR": "35545"}, "224527": {"PR": "56766"}, "268301": {"PR": "89020"}, "249711": {"summary": "Revert: backport PRs 80382, 80966"}, "274921": {"summary": "[backport] Enable math functions linking with static library for LTO"}, "276627": {"PR": "47054"} } full_author = re.compile (r"\s*((Mon|Tue|Wed|Thu|Fri|Sat|Sun)\s+(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+\d{1,2}\s+\d{2}:\d{2}:\d{2}\s+([A-Z]{3}\s+)?\d{4}|\d{4}-\d{2}-\d{2})\s+.+\s[(<][\w.]+@[\w.-]+[>)]") def read_bugdb(dbfile): try: with open(dbfile, "r") as f: bugdb.update(json.load(f)) print(len(bugdb)) f.close() except: return def write_bugdb(dbfile): try: os.delete(dbfile + ".bak") except: pass try: os.rename(dbfile, dbfile + ".bak") except: pass try: with open(dbfile, "w") as f: json.dump(bugdb, f) f.close() except: print("Failed to write" + dbfile) exit() def fetch_sublist(sublist): url = 'https://gcc.gnu.org/bugzilla/rest.cgi/bug?id=' url += ','.join([elem for elem in sublist]) url += '&include_fields=id,summary,component,resolution' resp = requests.get(url) if resp.status_code != 200: print(resp.status_code) return resp.json() def process(jdata): buglist = jdata['bugs'] for bug in buglist: bugdb[str(bug['id'])] = {"summary": bug['summary'], "component": bug['component'], "resolution": bug['resolution']} print(len(bugdb)) def fetchall(bugids): f = open(bugids, "r") allbugids = f.read().splitlines() sublist = [] for bugid in allbugids: if str(bugid) not in bugdb: #print("missing: " + id) sublist.append(bugid) if len(sublist) == 200: process(fetch_sublist(sublist)) sublist = [] for fixup in fixups.values(): if "PR" in fixup and fixup['PR'] not in bugdb: sublist.append(fixup['PR']) if len(sublist) == 200: process(fetch_sublist(sublist)) sublist = [] if sublist: process(fetch_sublist(sublist)) # This is lifted from the python version of reposurgeon, since we are # parsing objects it created. # Since we assume python2 rspolicy = None class RepoSurgeonEmail(email.message.Message, object): "Specialized email message with a distinguishing starter." Divider = 78 * "-" __hash__ = None def __init__(self, **kwargs): if rspolicy: # This activates the Python 3 email header handling hack kwargs['policy'] = rspolicy email.message.Message.__init__(self, **kwargs) self.set_unixfrom(RepoSurgeonEmail.Divider) @staticmethod def readmsg(fp): msg = '' firstline = fp.readline() if not firstline: return None elif not firstline.startswith(RepoSurgeonEmail.Divider): msg = firstline while True: line = fp.readline() if not line: break # Not impossible for this to get spoofed, but *highly* unlikely if line == "\n." + RepoSurgeonEmail.Divider + "\n": line = "\n" + RepoSurgeonEmail.Divider + "\n" if line.startswith(RepoSurgeonEmail.Divider): break msg += line return msg def payload(self): "Return message body - works because not multipart." return self.get_payload() def __str__(self): out = super(RepoSurgeonEmail, self).as_string(unixfrom=True) out = out.replace("\n" + RepoSurgeonEmail.Divider + "\n", "\n." + RepoSurgeonEmail.Divider + "\n") out = out.replace("\n>From", "\nFrom") return out # Close states that suggest that the bug is unlikely to be a correct # target of a fix. badclose = ["INVALID"] # Would be nice to add ["WORKSFORME", "WONTFIX"], but too many false # positives on early PRs. # Fuzzy match a claimed bugcat against the component in the PR. This avoids # a number of likely false negatives without over relaxing the desire to # detect possibly inaccurate PR numbers. def component_mismatch(bugcat, bugid): # We're not bothered about case bugcat = bugcat.lower() component = bugdb[bugid]['component'] # It's extremely unlikely that any bug classified as spam should match # a PR. It's also pretty unlikely that a bug closed as INVALID would # do so either. So flag up both cases. if (component == "spam" or bugdb[bugid]['resolution'] in badclose): return True # These blanket components lead to far too many false positives, so # just ignore them. This is essentially the reverse map of the "*" # match later on. if (bugcat in ["boostrap", "bootstrap", "target", "plugins", "testsuite", "other", "ice", "gcc", "translation", "regression", "build", "diagnostic", "wrong-code"] or # These are nonesense categories, since they refer to # components outside of GCC. bugcat in ["binutils", "gas", "ld"] or bugcat == component): return False if component not in compalias: return True if ("*" in compalias[component] or bugcat in compalias[component]): return False return True def summary_from_pr(msg, payload, lines): # Deal first with a first line that is a PR and a summary of its own, # we don't want to replace that. if (re.match(r"^\s*PR(\s+([^/\s]+)/|\s*#?)\d+\b", lines[0]) and not re.match(r"^\s*PR(\s+([^/\s]+)/|\s*#?)\d+\s*[-:.]?\s*$", lines[0])): return None y = re.search(r"(^|\s+|[/\(\[])PR(\s+([^/\s]+)/|\s*#?)(\d+)\b", payload) if not y: return None bugcat = y.group(3) bugid = y.group(4) if bugid not in bugdb: #print("PR " + bugid + " not found in database") return None summary = bugdb[bugid]['summary'].encode("utf-8") # Early on there were multiple GNATS databases for bugs; # later they were merged into a single database, but the # PR numbers were changed. Don't trust bugs below 1000 # unless the component matches the one mentioned in the # log (if it even has one). if (int(bugid) < 1000 and (bugcat == None or bugcat != bugdb[bugid]['component'])): print("E" + msg['Event-Number'] + "(r" + msg['Legacy-ID'] + "): Ignoring PR " + bugid + " - no component or component mismatch") return None elif (bugcat != None and msg['Legacy-ID'] not in whitelist and component_mismatch(bugcat, bugid)): problem = ("E" + msg['Event-Number'] + "(r" + msg['Legacy-ID'] + "): '" + bugcat + "/" + bugid + "'") if (bugdb[bugid]['resolution'] in badclose): problem += (" closed as " + bugdb[bugid]['resolution'].encode("utf-8")) summary += (" [checkme: " + bugdb[bugid]['resolution'].encode("utf-8") + " SVN r" + msg['Legacy-ID'] + "]") else: problem += (" mismatch PR " + bugdb[bugid]['component'].encode("utf-8")) summary += (" [checkme: " + bugcat + " SVN r" + msg['Legacy-ID'] + "]") print(problem) z = re.match(r"\s*\[[^\]]*[Rr]egression\s*\]\s*:?\s*(.*)", summary) if z: #print(summary) summary = z.group(1) summary = ("re PR " + bugdb[bugid]['component'].encode("utf-8") + "/" + bugid + " (" + summary + ")") return summary # Similar to strip_unintersting line, but used when we have a single # line that looks potentially interesting (because it begins with # a letter and ends with a period and we want to decide whether we really # want it as the final summary. def interesting_line(line): if re.match(r"^\s*([Ff]ix(e[ds])? (for )?)?PR(\s+[^/\s]+/|\s*#?)(\d+)[\.:]?\s*$", line): return False if re.match(r"\s*([Ff]ix|[Pp]atch) from", line): return False if re.match(r"\s*([Mm]erged?|[Bb]ack[-\s]?port(ed|s)?) (\w+ )?from ", line): return False if re.match(r"\s*[Ss]ynchronize with", line): return False if re.match(r"\s*[Ff]rom ", line): return False if re.match(r"\s*[Ii]n\s\S+$", line): return False if re.match(r"\s*(Reviewed|Signed-off|Acked|Reported)-by:", line): return False if re.match(r"\s*Reviewed-on:", line): return False return True def strip_uninteresting_lines(lines): backport = False revert = False keepgoing = True # Strip any blank lines, or any line containing a single word, unless # that word is revert. Mostly these make uninteresting summaries. # Also strip lines that begin "* From ...:", this was an early way # of attributing a contributor who did not make a commit themselves. while lines: if re.match(r"^\s*$", lines[0]): pass elif full_author.match(lines[0]): # Strip any author lines while (len(lines) > 1 and re.match(r"\s+\w.*\s<[\w.]+@[\w.-]+>", lines[1])): lines = lines[1:] elif re.match(r"\s*PR\d+", lines[0]): keepgoing = False elif re.match(r"\s*\*\s*From\s.*:$", lines[0]): pass elif re.match(r"\W*([Rr]evert(ed)?|[Bb]ack[-\s]*out)(\.|.*:)?$", lines[0]): revert = True elif re.match(r"\W*([Bb]ack[-\s]*port(|s|ed)|[Mm]erged?)\b", lines[0]): backport = True elif re.match(r"\s*(\S+\s*|)$", lines[0]): pass else: keepgoing = False if not keepgoing: break lines = lines[1:] return lines, backport, revert changemax = 80 def find_best_breakpoint(change): # Try, in turn, looking for ':', ';', ',' and ' - '. The idea is # to find the strongest natural break within the message. # Avoid breaking at '):' This is more likely to be a change hunk that # was malformed in some way and produces poor breaks. Similarly for # :: which is often a OO function separator x = re.match(r"^(.*?[^\)\]:]:)\s", change) if not x: x = re.match(r"^(.*?;)\s", change) if not x: x = re.match(r"^(.*?,)\s", change) if not x: x = re.match(r"^(.*?\s-(\s|$))\s", change) if x: return x.group(1)[0:len(x.group(1)) - 1] # Consider words that create subclauses, breaks after these are more # natural, but not if they are too early in the line. x = re.match(r"^(.*?\b(if|when|unless))(\s|$)", change[changemax / 2:changemax]) if x: return change[:(changemax / 2)] + x.group(1) x = re.match(r"^(.*\s)", change[:changemax]) if x: return x.group(1)[0:len(x.group(1)) - 1] return change[:changemax] badlines=[r"\s*$", r"\s*\*(\s|.*:(\s|$))", r"\s*\((\S+,\s+)*\S+\):", r"\s*/?(gcc|ada|testsuite|boehm-gc)(/[-\+\w])*/?:?$", r"\s*/?(brig|c|[cC]\+\+(filt)?|cfamily|ch)(/[-\+\w])*/?:?$", r"\s*/?(config|contrib|cp|doc|f|F77|fastjar)(/[-\+\w])*/?:?$", r"\s*/?([Ff]ortran|gcp|gfortran|gnattools)(/[-\+\w])*/?:?$", r"\s*/?(go|gotools|include|intl|java)(/[-\+\w])*/?:?$", r"\s*/?lib(ada|asan|atomic|backtrace|banshee)(/[-\+\w])*/?:?$", r"\s*/?lib(cc1|cpp|ffi|f2c|gcc|fortran|go|gomp)(/[-\+\w])*/?:?$", r"\s*/?lib(hsail-rt|iberty|io|itm|java|mudflap)(/[-\+\w])*/?:?$", r"\s*/?lib(objc|offloadmic|phobos|quadmath)(/[-\+\w])*/?:?$", r"\s*/?lib(sanitizer|ssp|stdc\+\+|vtv)(/[-\+\w])*/?:?$", r"\s*/?(lto|lto-plugin|maintainer-scripts|objc)(/[-\+\w])*/?:?$", r"\s*/?(src|zlib)(/[-\+\w])*/?:?$", r"\s\[gcc[^\]]*\]$", r"\(top level\)$"] # Return the first sentence, and any remaining text. def find_first_sentence(lines): # Never merge a PR number with additional lines if re.match(r"\s*PR(\s+[^/\s]+/|\s*#?)(\d+)", lines[0]): return lines[0], lines[1:] sentence = lines[0] x = re.match(r"\s*(.*?[\.\?!])(\s+(.*)|$)", sentence) if x: if x.group(3): return x.group(1), [x.group(3)] + lines[1:] return x.group(1), lines[1:] lines = lines[1:] while lines: for badline in badlines: if re.match(badline, lines[0]): return sentence, lines x = re.match(r"\s*(.*?\.)(\s+(.*)|$)", lines[0]) if x: if x.group(3): return sentence + " " + x.group(1), [x.group(3)] + lines[1:] return sentence + " " + x.group(1), lines[1:] sentence += " " + re.match(r"\s*(\S.*)$", lines[0]).group(1) lines = lines[1:] return sentence, lines def trim_excess_ws(mystr): # Strip any leading white space. mystr = re.sub(r"^\s+", "", mystr) # Canonicalize all white space into single space characters mystr = re.sub(r"\s{2,}", " ", mystr) # And any trailing white space mystr = re.sub(r"\s+$", "", mystr) return mystr def simplify_funclist(funcstr): funcstr = trim_excess_ws(funcstr) # Sanity check if funcstr[0] != "(" or funcstr[-1] != ")": return funcstr funcstr = funcstr[1:-1] firstfunc = re.match(r"^([^\(]+?(\([^\)]*\))?),\s(.*)", funcstr) if firstfunc and len(firstfunc.group(3)) > 10: return " (" + firstfunc.group(1) + ", [...])" return " (" + funcstr + ")" def simplify_filelist(filestr): filestr = trim_excess_ws(filestr) x = re.match(r"(\S+/)?(.+?)((,)?\s+(.*)|$)", filestr) if not x: return filestr sep = ", " if x.group(4) else " " if not x.group(5): return x.group(2) y = re.match(r"(\S+/)?(.+?)(,?\s+(.*)|$)", x.group(5)) # If the first two 'files' do not contain a '.', then we might not have # a real file list. Better to just return the original string. if sep == " " and not (re.match(r"\S+\.\S", x.group(2)) or re.match(r"\S+\.\S", y.group(2))): return filestr if y.group(4) or len(y.group(2)) > 10: return x.group(2) + sep + "[...]" return x.group(2) + sep + y.group(2) def summary_from_change(msg, sentence): # Pattern to match a commit change entry in GNU ChangeLog style. There # are some variants of this, but they are rare enough that we don't # worry too much about not handling them. # This RE is just too long to comprehend when written as a single # string... re_files = r"([-=\.\w\d_/\*\?\+,\s\[\]\{\}\$]+)" # 1 group re_funcs = r"(\s*\(.*?\))?" # 1 group re_cases = r"\s*(\[[^\]]+\])?" # 1 group (unused) re_case2 = r"(<[^>]+>)?" # 1 group (unused) re_descr = r"\s*:(\s+(.*))?$" # 2 groups x = re.match(r"\s*\*\s*" + re_files + re_funcs + re_cases + re_case2 + re_descr, sentence) if not x: # A relatively common typing error is to leave out the closing # ')' from a function list. Try again for that option. We # use a simpler match pattern for this case, leaving out the # cases variants. x = re.match(r"\s*\*\s*" + re_files + r"(\s*\([^\)]*?)" + re_descr, sentence) if not x: print ("Failed to parse: " + sentence) return None filename = simplify_filelist(x.group(1)) funcs = simplify_funclist(x.group(2) + ")") change = x.group(4) else: filename = simplify_filelist(x.group(1)) funcs = simplify_funclist(x.group(2)) if x.group(2) else "" change = x.group(6) if x.group(6) else "" # Just use the basename for any filename. if len(change) < changemax + 5: return trim_excess_ws(filename + funcs + ": " + change) while (len(change) > changemax): change = find_best_breakpoint(change) change += "..." return trim_excess_ws(filename + funcs + ": " + change) def find_best_summary(msg, lines): sentence, _ = find_first_sentence(lines) # ChangeLog entries look like "* file[ (func-list)]: change. if re.match(r"\s*\*\s*.+:\s", sentence): return summary_from_change(msg, sentence) return None def already_good_summary(lines): while lines and lines[0] == "": lines = lines[1:] # No non-blank lines. Nothing we can do to make this good, so best # to stop now. if not lines: return True # Anything created by cvs2svn should be left alone if re.search(r"by cvs2svn", lines[0]): return True # Single word lines, committer tags or simple PR tags don't make a # good summary if (full_author.match(lines[0]) or re.match(r"\s*PR(\s+[^/\s]+/|\s*#?)(\d+)\s*$", lines[0]) or re.match(r"s*\S+\s*$", lines[0])): return False # We likely can't make a single-line commit message better. if len(lines) == 1: return True if re.match(r"\s*\*", lines[0]): return False # A 'real' sentence, ie more than one word and ending with a # punctuation mark. We don't need a blank line in this case as # reposurgeon's gitify will handle this for us. if re.match(r"\s*[-\w\[\(\{\]\)\}_,.:\']\S*\s+\S.*[.:;,?!]$", lines[0]): return interesting_line(lines[0]) if not interesting_line(lines[0]): return False # If the second line is blank, treat this as good. if re.match(r"\s*$", lines[1]): return True # It might be a single-line summary that is wrapped onto more than # one line. Handle this case as it's unlikely we can improve it. # If the line looks overly long, then we may want to produce a shorter # version, so punt. sentence, remnant = find_first_sentence(lines) if (len(sentence) < 120 and remnant and re.match(r"\s*$", remnant[0])): return True return False def summary_from_sentence(sentence, remnant): # Ignore anything that looks like a change hunk. if re.match(r"\s*\*", sentence): return None if not interesting_line(sentence): return None sentence = trim_excess_ws(sentence) length = len(sentence) if length < 120 and not remnant: return None elif length < 120: return sentence return find_best_breakpoint(sentence) + "..." def processmsg(msg): payload = msg.get_payload() summary = None # Ignore events tagged with "emptycommit-*", these will be deleted # later on. if 'Tag-Name' in msg and re.match(r"^emptycommit-", msg['Tag-Name']): return None # Look to see if this event has a specific fixup. Use it if so. if msg['Legacy-ID'] in fixups: fixup = fixups[msg['Legacy-ID']] if 'ignore' in fixup: return None if "PR" in fixup: summary = bugdb[fixup['PR']]['summary'].encode("utf-8") z = re.match(r"\s*\[[^\]]*[Rr]egression\s*\]\s*:?\s*(.*)", summary) if z: summary = z.group(1) summary = ("re PR " + bugdb[fixup['PR']]['component'].encode("utf-8") + "/" + fixup['PR'] + " (" + summary + ")") elif "summary" in fixup: summary = fixup['summary'] else: print("Invalid fixup for Legacy-ID " + msg['Legacy-ID']) exit() lines = payload.splitlines() # Don't mess with something that looks like a good summary. if not summary and already_good_summary(lines): return None while lines and re.match(r"\s*$", lines[-1]): lines = lines[:-1] # Look to see if this event is a backport of multiple PRs. Just # set the summary to a quick list if so. if not summary: allprs = re.findall(r"\b\s*PR(\s+([^/\s]+)/|\s*#?)(\d+)\b", payload) allprnos = list(dict.fromkeys([int(pr[2]) for pr in allprs])) allprnos.sort() if ((len(allprnos) > 3 and re.search(r"\b[Bb]ack[- ]?port(s|ed)?\b", payload)) or (len(allprnos) > 1 and re.search(r"\b[Bb]ack[- ]?ports", payload))): prs = allprnos if len(allprnos) < 10 else allprnos[:10] #print prs[0] summary = "Backport PRs " + str(prs[0]) for pr in prs[1:]: summary += ", " + str(pr) if len(allprnos) > 10: summary += " and more" backport = False revert = False if not summary: while lines and lines[0] == "": lines = lines[1:] if not lines: return None multicommit = False x = full_author.match(lines[0]) author_line = lines[0] if x: if len(lines) > 2: is_backport = re.search(r"\b[Bb]ack[- ]?port", payload) for line in lines[2:]: if full_author.match(line): # If this looks like a backport, allow one # mismatch, since often we have the author of # the original commit and the author of the # backport its rare to get a full match in # that case. if is_backport and author_line != line: is_backport = None author_line = line elif author_line != line and lines[0] != line: multicommit = True break if multicommit: summary = "[multiple changes]" bugid = "" if author_line != lines[0]: backport = True if not summary: lines, backport, revert = strip_uninteresting_lines(lines) if not lines: return None sentence, remnant = find_first_sentence(lines) summary = summary_from_sentence(sentence, remnant) if not summary: summary = summary_from_pr(msg, payload, lines) if not summary: summary = find_best_summary(msg, lines) if not summary: return None preamble = "" if revert: preamble = "[revert]" if backport: preamble += "[backport]" if preamble != "": if summary[0] == "[": summary = preamble + summary else: summary = preamble + " " + summary newmsg = RepoSurgeonEmail() # All messages should have an event number, so this will hopefully # raise an exception if it fails newmsg['Event-Number'] = msg['Event-Number'] if "Check-Text" in msg: newmsg['Check-Text'] = msg['Check-Text'] # We must copy the tag name if it exists. if "Tag-Name" in msg: newmsg['Tag-Name'] = msg['Tag-Name'] try: newmsg.set_payload(summary + "\n\n" + payload) return str(newmsg) except: print("Unicode??? Cannot format message for Event " + msg['Event-Number'] + (summary if bugid == "" else (" PR " + bugid))) return None def main(): scrape_bugzilla = True for arg in sys.argv[1:]: if arg == "--noscrape": scrape_bugzilla = False else: print "Unknown option: " + arg exit(1) read_bugdb(bugdbfile) if scrape_bugzilla: try: fetchall(buglist) except KeyboardInterrupt: write_bugdb(bugdbfile) exit() write_bugdb(bugdbfile) commitlog = open(commits_in, "r") newlog = open(commits_out, "w") unfixed = open(commits_unfixed, "w") while True: msg = RepoSurgeonEmail.readmsg(commitlog) if not msg: break newmsg = processmsg(email.message_from_string(msg)) if newmsg: newlog.write(newmsg) else: unfixed.write(msg) newlog.close() commitlog.close() unfixed.close() if __name__ == '__main__': main() --------------2B67D9E0B5C37962D655DDF4--