public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed
* Re: Test GCC conversion with reposurgeon available
@ 2020-01-06 22:09 Loren James Rittle
  2020-01-07  9:35 ` Richard Earnshaw (lists)
  0 siblings, 1 reply; 54+ messages in thread
From: Loren James Rittle @ 2020-01-06 22:09 UTC (permalink / raw)
  To: gcc

On Fri, 3 Jan 2020, Joseph Myers wrote:

> git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-7a.git
> git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-7b.git

I have not had a substantial commit to gcc [or, likely, post to this
list] in a decade THUS a warm howdy to anyone still around from
1999-2009.  Every "git log" entry for my commits looked fine.  There
were two odd cases, but both now make sense to me.  My first commit's
Author line (1999) contains a physical machine name (but it correctly
matches the contemporary Changelog entry).  The second odd case took
more time to understand:

In gcc-reposurgeon-7a, commit hash 16fc918929 ; I corrected a prior
Changelog (for 89903bf801).  I would not expect any version control
system translation process to look ahead for such textual changes to a
side-maintained ChangeLog file but I was somewhat stumped on why the
ChangeLog text did not match the "git log" text.

Regards,
Loren

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2020-01-06 22:09 Test GCC conversion with reposurgeon available Loren James Rittle
@ 2020-01-07  9:35 ` Richard Earnshaw (lists)
  2020-01-07 15:53   ` Loren James Rittle
  0 siblings, 1 reply; 54+ messages in thread
From: Richard Earnshaw (lists) @ 2020-01-07  9:35 UTC (permalink / raw)
  To: Loren James Rittle, gcc

On 06/01/2020 22:09, Loren James Rittle wrote:
> On Fri, 3 Jan 2020, Joseph Myers wrote:
> 
>> git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-7a.git
>> git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-7b.git
> 
> I have not had a substantial commit to gcc [or, likely, post to this
> list] in a decade THUS a warm howdy to anyone still around from
> 1999-2009.  Every "git log" entry for my commits looked fine.  There
> were two odd cases, but both now make sense to me.  My first commit's
> Author line (1999) contains a physical machine name (but it correctly
> matches the contemporary Changelog entry).  The second odd case took
> more time to understand:
> 
> In gcc-reposurgeon-7a, commit hash 16fc918929 ; I corrected a prior
> Changelog (for 89903bf801).  I would not expect any version control
> system translation process to look ahead for such textual changes to a
> side-maintained ChangeLog file but I was somewhat stumped on why the
> ChangeLog text did not match the "git log" text.
> 
> Regards,
> Loren
> 

We can add a correction so that the git 'Author' field will be fixed, if 
you would like; but the ChangeLog and commit message will remain as is.

R.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2020-01-07  9:35 ` Richard Earnshaw (lists)
@ 2020-01-07 15:53   ` Loren James Rittle
  0 siblings, 0 replies; 54+ messages in thread
From: Loren James Rittle @ 2020-01-07 15:53 UTC (permalink / raw)
  To: Richard Earnshaw (lists); +Cc: gcc

Richard,

Thanks for the offer, but no need.  Just wanted to confirm with some
detail that I reviewed aspects of the svn-git conversion and LGTM.

BTW, I too saw the issue (in 14 out of 261 master commits) reported by
Andrew where (in my case) "ljrittle@gcc.gnu.org" was used in Author
line(s) rather than the expected e-mail.  In every one of my cases, it
appears because no exact Changelog text was related to the commit
rather than, as in Andrew's analysis of case mismatches.

Regards,
Loren

On Tue, Jan 7, 2020 at 3:41 AM Richard Earnshaw (lists)
<Richard.Earnshaw@arm.com> wrote:
>
> On 06/01/2020 22:09, Loren James Rittle wrote:
> > On Fri, 3 Jan 2020, Joseph Myers wrote:
> >
> >> git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-7a.git
> >> git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-7b.git
> >
> > I have not had a substantial commit to gcc [or, likely, post to this
> > list] in a decade THUS a warm howdy to anyone still around from
> > 1999-2009.  Every "git log" entry for my commits looked fine.  There
> > were two odd cases, but both now make sense to me.  My first commit's
> > Author line (1999) contains a physical machine name (but it correctly
> > matches the contemporary Changelog entry).  The second odd case took
> > more time to understand:
> >
> > In gcc-reposurgeon-7a, commit hash 16fc918929 ; I corrected a prior
> > Changelog (for 89903bf801).  I would not expect any version control
> > system translation process to look ahead for such textual changes to a
> > side-maintained ChangeLog file but I was somewhat stumped on why the
> > ChangeLog text did not match the "git log" text.
> >
> > Regards,
> > Loren
> >
>
> We can add a correction so that the git 'Author' field will be fixed, if
> you would like; but the ChangeLog and commit message will remain as is.
>
> R.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2020-01-09 12:22           ` Joseph Myers
@ 2020-01-09 21:57             ` Joseph Myers
  0 siblings, 0 replies; 54+ messages in thread
From: Joseph Myers @ 2020-01-09 21:57 UTC (permalink / raw)
  To: gcc; +Cc: esr

On Thu, 9 Jan 2020, Joseph Myers wrote:

> Here's a test conversion with the conversion machinery in what should be 
> essentially final form.  This is like the "b" versions (dead and vendor 
> branches present but not fetched by default), with the addition of refs 
> from the existing git mirror as refs/git-old/* and refs/git-svn-old/* (not 
> fetched by default).
> 
> git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-8.git

Hooks are now set up and ready for testing commits to this repository, 
including integration with gcc-cvs and libstdc++-cvs mailing lists and 
Bugzilla.  I recommend only referencing test bugs you open for the purpose 
in commits, not real bugs, to avoid confusing people, as the messages that 
end up in Bugzilla don't make it obvious that this is a test conversion 
(the messages on the -cvs mailing lists are more obvious, as they say 
"gcc-reposurgeon-8" in their subject headers).  Commits made in this 
repository will not end up in the real conversion.  gitweb URLs in 
messages from this conversion won't actually work, because they point to 
the real gitweb (which currently points to the git-svn mirror).

All commits should generate commit emails.  Only commits to master and 
release branches should generate Bugzilla updates.  master and release 
branches do not allow merge commits, other branches do.  Branches in 
refs/users/<user>/heads and refs/vendors/<vendor>/heads allow 
non-fast-forward pushes and branch deletion, other branches don't.

Branch updates or new branches based on the history from the git-svn 
mirror (with 3cf0d8938a953ef13e57239613d42686f152b4fe, the initial git-svn 
commit, in their ancestry) are disallowed; this avoids someone 
accidentally pushing such a branch to a namespace that git fetches by 
default and causing everyone to fetch a GB of extra history as a result.  
Thus, for continued development based on such a branch you should start by 
rebasing (not merging) onto the new version of the history, and then the 
rebased branch can be pushed to one of the supported namespaces (under 
refs/users/<user>/heads/ if treating it as a user branch, under 
refs/heads/devel/ for a development branch that gets fetched by default).

The hook configuration is something that seemed reasonable as a starting 
point, not necessarily what we will have as the final configuration.  The 
configuration file for the AdaCore hooks is project.config on the 
refs/meta/config branch (but there are local patches to those hooks at 
present, and as with other refs, changes to refs/meta/config will not 
persist to the final converted repository).

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2020-01-03 12:38         ` Joseph Myers
  2020-01-06 23:58           ` Andrew Pinski
@ 2020-01-09 12:22           ` Joseph Myers
  2020-01-09 21:57             ` Joseph Myers
  1 sibling, 1 reply; 54+ messages in thread
From: Joseph Myers @ 2020-01-09 12:22 UTC (permalink / raw)
  To: gcc; +Cc: esr

On Fri, 3 Jan 2020, Joseph Myers wrote:

> On Sat, 28 Dec 2019, Joseph Myers wrote:
> 
> > Two more.
> > 
> > git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-6a.git
> > git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-6b.git
> 
> Two more.
> 
> git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-7a.git
> git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-7b.git

Here's a test conversion with the conversion machinery in what should be 
essentially final form.  This is like the "b" versions (dead and vendor 
branches present but not fetched by default), with the addition of refs 
from the existing git mirror as refs/git-old/* and refs/git-svn-old/* (not 
fetched by default).

git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-8.git

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2020-01-06 23:58           ` Andrew Pinski
  2020-01-07  0:30             ` Joseph Myers
@ 2020-01-07  0:44             ` Richard Earnshaw
  1 sibling, 0 replies; 54+ messages in thread
From: Richard Earnshaw @ 2020-01-07  0:44 UTC (permalink / raw)
  To: Andrew Pinski, Joseph Myers; +Cc: GCC Mailing List, Eric Raymond

On 06/01/2020 23:57, Andrew Pinski wrote:
> On Fri, Jan 3, 2020 at 4:38 AM Joseph Myers <joseph@codesourcery.com> wrote:
>>
>> On Sat, 28 Dec 2019, Joseph Myers wrote:
>>
>>> Two more.
>>>
>>> git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-6a.git
>>> git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-6b.git
>>
>> Two more.
>>
>> git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-7a.git
>> git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-7b.git
>>
>> These have further accumulated improvements, especially identifying more
>> merge commits, many improved author attributions, and many improved commit
>> summaries / PR number fixups from Richard's scripts.  We're working on
>> identifying further cases where author attributions can be safely
>> improved.
> 
> A few comments about my commits (and others).
> * SVN r133438 had the wrong PR # used but it was fixed with SVN r133439.

I've added a fixup

> * SVN r134947, only has one commit associated with it but with new
> testcases too.

This is due to the emails not matching exactly.  It's not really
practical to do case independent comparisons, so I've added a forced
email for the author and fixed the summary manually.

> ** Maybe just use the non testsuite/ChangeLog reference as the subject line.

The scanner for this doesn't look at ChangeLog files, only at the commit
message itself.  Again, it's not really feasible to tease
testsuite/non-testsuite changes out reliably, so I don't try.

> ** Also I Noticed the author for that revision is detected as
> pinskia@gcc.gnu.org but that is because I used different cases for the
> emails in the changelog.

Which lead to all of the above.


R.

> *** Maybe always using lower case for the email part.
> * SVN r160418, does not detect me as the main author or even detect
> Shujing's email.
> ** the changelog entry below what was added had a minor whitespace change to it
> * SVN r211205, does not detect my email address correctly
> ** I had a typo in the date format (014-06-03 when it should have been
> 2014-06-03)
> 
> I don't care if these minor issues don't get fixed, but I suspect
> fixing them will help fix other issues; I don't know if these have
> been fixed yet.
> 
> Thanks,
> Andrew Pinski
> 
> 
>>
>> --
>> Joseph S. Myers
>> joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2020-01-06 23:58           ` Andrew Pinski
@ 2020-01-07  0:30             ` Joseph Myers
  2020-01-07  0:44             ` Richard Earnshaw
  1 sibling, 0 replies; 54+ messages in thread
From: Joseph Myers @ 2020-01-07  0:30 UTC (permalink / raw)
  To: Andrew Pinski; +Cc: GCC Mailing List, Eric Raymond

On Mon, 6 Jan 2020, Andrew Pinski wrote:

> ** Also I Noticed the author for that revision is detected as
> pinskia@gcc.gnu.org but that is because I used different cases for the
> emails in the changelog.

In my review of possibly suspect authors I'm concentrating on cases where 
the author name needs review (based on either reposurgeon reporting a 
ChangeLog header it can't parse, or differences in author name from 
Maxim's conversion, as an indication that the choice of name is worthy of 
review), not those where the name is probably OK but it might be possible 
to improve the choice of email address.

> * SVN r160418, does not detect me as the main author or even detect
> Shujing's email.
> ** the changelog entry below what was added had a minor whitespace change to it
> * SVN r211205, does not detect my email address correctly

Both of those are included in the manual fixups I've done for suspect 
cases (since that conversion was generated).  I've reviewed all the (about 
500) cases of reposurgeon reporting ChangeLog headers it can't parse (with 
a wide range of typos in date formats etc.), other than those Richard set 
up fixes for before I started that review, and all the (about 1200) most 
likely to be significant cases of differences in authors (names) from 
Maxim's conversion (cases where the commit isn't ChangeLog-only or marked 
as a backport, resulting in about 400 author improvements, the rest being 
cases where I thought the author from reposurgeon was the most appropriate 
one; ChangeLog-missing cases had been reviewed earlier).

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2020-01-03 12:38         ` Joseph Myers
@ 2020-01-06 23:58           ` Andrew Pinski
  2020-01-07  0:30             ` Joseph Myers
  2020-01-07  0:44             ` Richard Earnshaw
  2020-01-09 12:22           ` Joseph Myers
  1 sibling, 2 replies; 54+ messages in thread
From: Andrew Pinski @ 2020-01-06 23:58 UTC (permalink / raw)
  To: Joseph Myers; +Cc: GCC Mailing List, Eric Raymond

On Fri, Jan 3, 2020 at 4:38 AM Joseph Myers <joseph@codesourcery.com> wrote:
>
> On Sat, 28 Dec 2019, Joseph Myers wrote:
>
> > Two more.
> >
> > git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-6a.git
> > git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-6b.git
>
> Two more.
>
> git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-7a.git
> git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-7b.git
>
> These have further accumulated improvements, especially identifying more
> merge commits, many improved author attributions, and many improved commit
> summaries / PR number fixups from Richard's scripts.  We're working on
> identifying further cases where author attributions can be safely
> improved.

A few comments about my commits (and others).
* SVN r133438 had the wrong PR # used but it was fixed with SVN r133439.
* SVN r134947, only has one commit associated with it but with new
testcases too.
** Maybe just use the non testsuite/ChangeLog reference as the subject line.
** Also I Noticed the author for that revision is detected as
pinskia@gcc.gnu.org but that is because I used different cases for the
emails in the changelog.
*** Maybe always using lower case for the email part.
* SVN r160418, does not detect me as the main author or even detect
Shujing's email.
** the changelog entry below what was added had a minor whitespace change to it
* SVN r211205, does not detect my email address correctly
** I had a typo in the date format (014-06-03 when it should have been
2014-06-03)

I don't care if these minor issues don't get fixed, but I suspect
fixing them will help fix other issues; I don't know if these have
been fixed yet.

Thanks,
Andrew Pinski


>
> --
> Joseph S. Myers
> joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2019-12-28 16:30       ` Joseph Myers
@ 2020-01-03 12:38         ` Joseph Myers
  2020-01-06 23:58           ` Andrew Pinski
  2020-01-09 12:22           ` Joseph Myers
  0 siblings, 2 replies; 54+ messages in thread
From: Joseph Myers @ 2020-01-03 12:38 UTC (permalink / raw)
  To: gcc; +Cc: esr

On Sat, 28 Dec 2019, Joseph Myers wrote:

> Two more.
> 
> git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-6a.git
> git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-6b.git

Two more.

git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-7a.git
git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-7b.git

These have further accumulated improvements, especially identifying more 
merge commits, many improved author attributions, and many improved commit 
summaries / PR number fixups from Richard's scripts.  We're working on 
identifying further cases where author attributions can be safely 
improved.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2019-12-22 13:57     ` Joseph Myers
  2019-12-23 17:27       ` Roman Zhuykov
  2019-12-24 10:57       ` Maxim Kuvyrkov
@ 2019-12-28 16:30       ` Joseph Myers
  2020-01-03 12:38         ` Joseph Myers
  2 siblings, 1 reply; 54+ messages in thread
From: Joseph Myers @ 2019-12-28 16:30 UTC (permalink / raw)
  To: gcc; +Cc: esr

On Sun, 22 Dec 2019, Joseph Myers wrote:

> Two more.
> 
> git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-5a.git
> git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-5b.git

Two more.

git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-6a.git
git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-6b.git

These have accumulated mostly minor improvements.  There are no 
automatically-generated .gitignore files any more.  There are no merge 
commits on master any more.  There are various improvements relating to 
ChangeLog handling, though some more such improvements went into 
reposurgeon after this conversion run started and thus aren't included 
(and this conversion doesn't include Richard's list of typo fixes for 
attributions).  There are many more whitelistings / PR fixups for 
generated commit summaries (so all revisions 200000 and later with a 
checkme have had it resolved).

Regarding the ChangeLog improvements, the specific common patterns Jakub 
noted that appear at the start and end of the alphabet are worked around 
in bugdb.py for this conversion and handled properly in reposurgeon in 
changes that went in after this conversion started.  Of the three authors 
remaining that look like they follow those patterns, one should be handled 
by the code now in reposurgeon (just not by the workaround in bugdb.py) 
and the other two are cases where I'll add fixup entries in bugdb.py (they 
do things like repeat the whole date twice).  I'll also add a fix in 
bugdb.py for the cases where the name is inside "" or () in the ChangeLog 
entry as that can be handled fully automatically.

-- 
Joseph S. Myers
jsm@polyomino.org.uk

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2019-12-27 21:30                       ` Andreas Schwab
@ 2019-12-28  2:43                         ` Eric S. Raymond
  0 siblings, 0 replies; 54+ messages in thread
From: Eric S. Raymond @ 2019-12-28  2:43 UTC (permalink / raw)
  To: Andreas Schwab
  Cc: Roman Zhuykov, Segher Boessenkool, Joseph Myers, gcc,
	Alexander Monakov, Maxim Kuvyrkov

Andreas Schwab <schwab@linux-m68k.org>:
> On Dez 25 2019, Eric S. Raymond wrote:
> 
> > That's easily fixed by adding a timezone entry to your author-map
> > entry - CET, is it?
> 
> The time zone is not constant.

Congratulations, you have broken one of reposurgeon's assumptions.

It is possible to use reposurgeon;d DSL tset committer TZ on a 
selected set of commits; if you want to work uo a patch for the
lift script we'll take it.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2019-12-27 21:29                           ` Andreas Schwab
@ 2019-12-27 21:43                             ` Joseph Myers
  0 siblings, 0 replies; 54+ messages in thread
From: Joseph Myers @ 2019-12-27 21:43 UTC (permalink / raw)
  To: Andreas Schwab
  Cc: Roman Zhuykov, Segher Boessenkool, gcc, Alexander Monakov,
	Maxim Kuvyrkov, esr

On Fri, 27 Dec 2019, Andreas Schwab wrote:

> SVN also only has a committer, so the fabricated author should not be
> influenced by the committer.

That issue has been fixed.

-- 
Joseph S. Myers
jsm@polyomino.org.uk

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2019-12-25 19:19                     ` Eric S. Raymond
@ 2019-12-27 21:30                       ` Andreas Schwab
  2019-12-28  2:43                         ` Eric S. Raymond
  0 siblings, 1 reply; 54+ messages in thread
From: Andreas Schwab @ 2019-12-27 21:30 UTC (permalink / raw)
  To: Eric S. Raymond
  Cc: Roman Zhuykov, Segher Boessenkool, Joseph Myers, gcc,
	Alexander Monakov, Maxim Kuvyrkov

On Dez 25 2019, Eric S. Raymond wrote:

> That's easily fixed by adding a timezone entry to your author-map
> entry - CET, is it?

The time zone is not constant.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2019-12-25 15:36                         ` Joseph Myers
  2019-12-25 17:15                           ` Segher Boessenkool
  2019-12-25 19:40                           ` Eric S. Raymond
@ 2019-12-27 21:29                           ` Andreas Schwab
  2019-12-27 21:43                             ` Joseph Myers
  2 siblings, 1 reply; 54+ messages in thread
From: Andreas Schwab @ 2019-12-27 21:29 UTC (permalink / raw)
  To: Joseph Myers
  Cc: Roman Zhuykov, Segher Boessenkool, gcc, Alexander Monakov,
	Maxim Kuvyrkov, esr

On Dez 25 2019, Joseph Myers wrote:

> On investigation, I think you are referring to the conversion of r269472.  
> That was committed for you by Jim Wilson and thus has you as author and 
> Jim Wilson as committer and Jim Wilson's timezone entry has been applied.  
> So the argument here is that the author's timezone information should be 
> applied to the author date, and the committer's timezone information 
> should be applied to the committer date.  I expect that should be 
> straightforward (although when coming from SVN, there's also an argument 
> that we only have committer dates so the committer timezone is the 
> relevant one to apply).

SVN also only has a committer, so the fabricated author should not be
influenced by the committer.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2019-12-26 22:32                                     ` Eric S. Raymond
@ 2019-12-27 14:40                                       ` Segher Boessenkool
  0 siblings, 0 replies; 54+ messages in thread
From: Segher Boessenkool @ 2019-12-27 14:40 UTC (permalink / raw)
  To: Eric S. Raymond; +Cc: Toon Moene, gcc

On Thu, Dec 26, 2019 at 05:32:52PM -0500, Eric S. Raymond wrote:
> Toon Moene <toon@moene.org>:
> > So we are going to base this world wide free software endeavor on a source
> > code system that doesn't keep time by UTC ?
> 
> They all *do* keep time by UTC.

(Git stores unix time, instead -- close enough ;-) )

> What confuses me is why they every try to *display* anything other than UTC.

That depends on what you yourself configured (log.date, blame.date).

> It seems pointless to me to ever display local time in clients, but they do it
> anyway. 

Many people only work with their local team in their local office.

For GCC we do not care, and we could always store +0000 as timezone
offset -- we do not *have* the correct data, and at least UTC is more
useful to display.


Segher

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2019-12-25 11:03                 ` Roman Zhuykov
  2019-12-25 11:20                   ` Joseph Myers
  2019-12-25 14:32                   ` Andreas Schwab
@ 2019-12-27 14:37                   ` Richard Earnshaw
  2 siblings, 0 replies; 54+ messages in thread
From: Richard Earnshaw @ 2019-12-27 14:37 UTC (permalink / raw)
  To: Roman Zhuykov, Segher Boessenkool, Joseph Myers
  Cc: gcc, Alexander Monakov, Maxim Kuvyrkov, esr

On 25/12/2019 11:02, Roman Zhuykov wrote:
> First of all thanks to everyone who spent time making the conversion
> better and better. Here is my 2c, I have studied a little my colleagues
> trunk history in Maxim's gcc-pretty vs gcc-reposurgeon-5b.
> 
> 1) In gcc-pretty timezone info is lost in both author/commiter date
> (keeping UTC time correct, certainly). Examples are r278990 and r289989.
> Probably git-svn causes this, current read-only git mirror is also
> without timezone. Not sure we need that info, but reposurgeon is more
> correct here.
> 
> 2) Some thoughts about script for summarizing commit log messages:
> 2a) Why r143753 and r150680 not have "re PR..." summary instead of
> "[multiple changes]" ?

Both of these commits have more than one hunk with different authors,
that triggers the heuristic to detect multiple independent changes that
have been merged into a single commit (this is most common on branches,
where it is common to aggregate a large number of backport commits - for
those, using the first PR found as a key is rarely right even if there
is only one PR mentioned).  As Joseph said, providing a specific rule
for these cases is possible.

> 2b) On the contrary r155892 have to mention two PRs, even "[multiple
> changes]" is better here, IMHO.

For this one, the heuristic is that the PRs are likely to be related and
that either *could* form a summary.  We choose the first one mentioned.
 Looking at the commit log I cannot tell whether this is two independent
fixes rolled into a single commit or two bugs that relate to the same
change.

> 2c) In r130050 and r155902 we have "Rename too ... " in summary, not
> sure how to make it better.

Well, we *could* try to extract the target function name for the rename
of the traget, but I'm not about to try that right now.

> 2d) r146882 can have better summary if we somehow organize ChangeLog
> priority (gcc/ChangeLog is more important that testsuite one).

Commit logs follow conventions quite weakly (if they followed them more
strongly we wouldn't need the script at all, since all of them would
have a summary line already ;-)  As such, parsing them is not easy.  The
real question you need to answer is along the lines of: is

	20071210-2.c: New testcase.

a *less* useful summary than

	gcc/testsuite/Changelog:

(which is what will appear if we do nothing in this case)?

Unless the answer to that is 'no' then my script is working as intended,
which is to try to produce something more useful than we would get by
default.

If we had another year, it might be possible to develop some AI that
read the commit log, searched the mailing list the relevant email for
the commit and then proceeded to solve the halting problem as something
trivial along the side, but we aren't going to wait that long ;)

More importantly, if you see commits with particularly egregious
summaries in the trial conversions, please file a ticket at
https://gitlab.com/esr/gcc-conversion.git and we can look into what
suitable action might be needed.

R.

> 
> 3) About author emails, see below
> 24.12.2019 21:14, Segher Boessenkool wrote:
>> On Tue, Dec 24, 2019 at 05:16:54PM +0000, Joseph Myers wrote:
>>> On Tue, 24 Dec 2019, Segher Boessenkool wrote:
>>>>> That's because that commit also edits ChangeLog entries from other
>>>>> authors.  When a commit adds / edits ChangeLog entries for more
>>>>> than one
>>>>> author (the difference between purely editing an existing entry and
>>>>> adding
>>>>> a new one, possibly under an existing date/author header, for a
>>>>> multi-author commit, is not something that can reliably be determined
>>>>> automatically), the conversion falls back to using the committer
>>>>> identity
>>>>> instead of picking one of the multiple relevant authors from the
>>>>> ChangeLog
>>>>> files.
>>>> There is only one relevant author in r270511.  It edits a few wrong
>>>> path
>>>> names in the previous changelog entries.  People often do similar
>>>> things
>>>> (like fixing the commit date :-) )
>>> Distinguishing "edits a previous ChangeLog entry" from "adds a new entry
>>> under a previous ChangeLog header for a change included in the
>>> commit" is
>>> a human judgement.
>> We are doing only one conversion here, the one of the GCC repo.  The
>> heuristic works, we checked it did.
>>
>>>> Either never use <account>@gcc.gnu.org, or always use it, don't do the
>>>> worst of both worlds?
>>> The heuristics here are to use an attribution from ChangeLog for the
>>> author where unambiguous, but to use the committer (always
>>> @gcc.gnu.org /
>>> @gnu.org [*], so avoiding attributions at the wrong company even where
>>> people were using multiple addresses simultaneously for different
>>> changes)
>>> as author if in doubt.
>> You never need that, and it is worse to use two different schemes than to
>> choose either.
>>
>> I would have chosen the "<account>@gcc.gnu.org" scheme, because it is
>> simple and *correct*.  Other people wanted the nicer names.  Maxim's
>> conversion gets that correct.  Please copy it.
>>
> IMHO Segher is a bit categorical is the discussion, but I'll be glad to
> see brief description of Maxim's approach to manage emails, gcc-pretty
> shows better results.
> Speaking about the script counting authors from ChangeLog files, even if
> we drop an "edits a previous ChangeLog entry" issue, it still sometimes
> work not as Joseph described:
> 3a) In r155892, r155893 and r259314 Alex is not counted as the only
> author without any reason.
> 3b) In r139854, r141108 and r196252 script selected the author
> successfully, while actually there are more that one.
> 3c) Maybe here we can also somehow organize ChangeLog priority (again,
> gcc/ChangeLog is more important that testsuite one). There are a lot of
> examples, when testsuite/ChangeLog entry have another author: r145055,
> r150680, r155889, r155894, r155890, r163904, r180186 and r183325.
> 3d) If we fix 3b+3c we can also look at r143753, r155890 and r155895.
> 3e) r155891, r207422, r183627 and r234218 are examples of commits which
> don't touch any ChangeLog files for different reasons. Seems unsolvable
> in current approach.
> 
> -- 
> Roman

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2019-12-26 22:57                                   ` Vincent Lefevre
@ 2019-12-26 23:38                                     ` Eric S. Raymond
  0 siblings, 0 replies; 54+ messages in thread
From: Eric S. Raymond @ 2019-12-26 23:38 UTC (permalink / raw)
  To: gcc

Vincent Lefevre <vincent+gcc@vinc17.org>:
> What matters is that the date is correct. I don't think the timezone
> matters (that's why SVN doesn't store timezone information, I assume),
> possibly except for the committer himself (?). For instance,

Subversion doesn't store timezone because all commits are consifered
to have occurred at UTC time on a central repository.

I think time as well as date matters because soimetimes it could be 
information of significance what order commits were in even if they 
were on the same day.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2019-12-26 21:31                                 ` Eric S. Raymond
  2019-12-26 22:25                                   ` Toon Moene
@ 2019-12-26 22:57                                   ` Vincent Lefevre
  2019-12-26 23:38                                     ` Eric S. Raymond
  1 sibling, 1 reply; 54+ messages in thread
From: Vincent Lefevre @ 2019-12-26 22:57 UTC (permalink / raw)
  To: gcc

On 2019-12-26 16:30:15 -0500, Eric S. Raymond wrote:
> Vincent Lefevre <vincent+gcc@vinc17.org>:
> > > Here's why you want to get timezones right: there are going to be times
> > > when the order of commits is significant information for a developer's
> > > understanding of what happened.  But without a timezone you only know 
> > > the actual time of a commit to 24-hour resoltion.
> > 
> > I don't understand what you mean. What matters for the order of
> > commits is the global time, and this is what SVN stores. SVN does not
> > store timezone information, i.e. it has no idea of what local time of
> > the user had, but I don't think this is important information.
> 
> UTC time plus a timezone offset set is what git stores.  That's not the
> locus of the problem.
> 
> In Subversion-land there's newver any doubt about the sequence of commits;
> the revision numbers tell you that.  In Git-land you have to go by timestamps,
> and if a timezone entry is wrong it can skew the displayed time.

What matters is that the date is correct. I don't think the timezone
matters (that's why SVN doesn't store timezone information, I assume),
possibly except for the committer himself (?). For instance,

  2019-11-27 02:32:02 +0100

and

  2019-11-27 01:32:02 +0000

correspond to the same date. So, each one (as stored in the repository)
is fine if you want to be able to know the order of commits.

What is displayed then is actually a user-config issue. The conversion
utility can't solve this issue, since after conversion, committers will
be able to use any timezone they like.

> Me, I don't undertstand why version-control systems designed for distributed
> use don't ignore timezones entirely and display all times in UTC - relative
> time is surely more imoortant than the commit time's relationship to solar
> noon wherever the keyboard happened to be. But I don't make these decisions.

I agree, at least being able to display all times in a *fixed* timezone
(chosen by the user), as this could be easier for the user to know when
recent commits occur (by "recent", this can be less than 24 hours ago).

For UTC, you can use:

  TZ=UTC git log --date=iso-local

The date format can be stored in ~/.gitconfig, but unfortunately
not local timezone information.

In this case, the timezones of the commits chosen by the conversion
utility will not matter at all.

-- 
Vincent Lefèvre <vincent@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2019-12-26 22:25                                   ` Toon Moene
@ 2019-12-26 22:32                                     ` Eric S. Raymond
  2019-12-27 14:40                                       ` Segher Boessenkool
  0 siblings, 1 reply; 54+ messages in thread
From: Eric S. Raymond @ 2019-12-26 22:32 UTC (permalink / raw)
  To: Toon Moene; +Cc: gcc

Toon Moene <toon@moene.org>:
> On 12/26/19 10:30 PM, Eric S. Raymond wrote:
> 
> > Me, I don't undertstand why version-control systems designed for distributed
> > use don't ignore timezones entirely and display all times in UTC - relative
> > time is surely more imoortant than the commit time's relationship to solar
> > noon wherever the keyboard happened to be. But I don't make these decisions.
> 
> So we are going to base this world wide free software endeavor on a source
> code system that doesn't keep time by UTC ?

They all *do* keep time by UTC.

What confuses me is why they every try to *display* anything other than UTC.
It seems pointless to me to ever display local time in clients, but they do it
anyway. 

Wiothout that complication, there would be no need to track user timezones.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2019-12-26 21:31                                 ` Eric S. Raymond
@ 2019-12-26 22:25                                   ` Toon Moene
  2019-12-26 22:32                                     ` Eric S. Raymond
  2019-12-26 22:57                                   ` Vincent Lefevre
  1 sibling, 1 reply; 54+ messages in thread
From: Toon Moene @ 2019-12-26 22:25 UTC (permalink / raw)
  To: esr, gcc

On 12/26/19 10:30 PM, Eric S. Raymond wrote:

> Me, I don't undertstand why version-control systems designed for distributed
> use don't ignore timezones entirely and display all times in UTC - relative
> time is surely more imoortant than the commit time's relationship to solar
> noon wherever the keyboard happened to be. But I don't make these decisions.

So we are going to base this world wide free software endeavor on a 
source code system that doesn't keep time by UTC ?

My God - imagine if weather forecasting was done this way.

-- 
Toon Moene - e-mail: toon@moene.org - phone: +31 346 214290
Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/
Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2019-12-26 21:03                               ` Vincent Lefevre
@ 2019-12-26 21:31                                 ` Eric S. Raymond
  2019-12-26 22:25                                   ` Toon Moene
  2019-12-26 22:57                                   ` Vincent Lefevre
  0 siblings, 2 replies; 54+ messages in thread
From: Eric S. Raymond @ 2019-12-26 21:31 UTC (permalink / raw)
  To: gcc

Vincent Lefevre <vincent+gcc@vinc17.org>:
> > Here's why you want to get timezones right: there are going to be times
> > when the order of commits is significant information for a developer's
> > understanding of what happened.  But without a timezone you only know 
> > the actual time of a commit to 24-hour resoltion.
> 
> I don't understand what you mean. What matters for the order of
> commits is the global time, and this is what SVN stores. SVN does not
> store timezone information, i.e. it has no idea of what local time of
> the user had, but I don't think this is important information.

UTC time plus a timezone offset set is what git stores.  That's not the
locus of the problem.

In Subversion-land there's newver any doubt about the sequence of commits;
the revision numbers tell you that.  In Git-land you have to go by timestamps,
and if a timezone entry is wrong it can skew the displayed time.

Me, I don't undertstand why version-control systems designed for distributed
use don't ignore timezones entirely and display all times in UTC - relative
time is surely more imoortant than the commit time's relationship to solar
noon wherever the keyboard happened to be. But I don't make these decisions.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2019-12-25 19:33                             ` Eric S. Raymond
@ 2019-12-26 21:03                               ` Vincent Lefevre
  2019-12-26 21:31                                 ` Eric S. Raymond
  0 siblings, 1 reply; 54+ messages in thread
From: Vincent Lefevre @ 2019-12-26 21:03 UTC (permalink / raw)
  To: gcc

On 2019-12-25 14:33:45 -0500, Eric S. Raymond wrote:
> Segher Boessenkool <segher@kernel.crashing.org>:
> > The goal is not to pretend we never used SVN.
> 
> One of *my* goals is that the illusion of git back to the beginning of
> time should be as consistent as possible.
> 
> > The goal is to have a Git repo that is as useful as possible for us.
> 
> Exactly.  I've already written about minimizing cognitive friction.
> 
> Here's why you want to get timezones right: there are going to be times
> when the order of commits is significant information for a developer's
> understanding of what happened.  But without a timezone you only know 
> the actual time of a commit to 24-hour resoltion.

I don't understand what you mean. What matters for the order of
commits is the global time, and this is what SVN stores. SVN does not
store timezone information, i.e. it has no idea of what local time of
the user had, but I don't think this is important information.

-- 
Vincent Lefèvre <vincent@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2019-12-25 15:36                         ` Joseph Myers
  2019-12-25 17:15                           ` Segher Boessenkool
@ 2019-12-25 19:40                           ` Eric S. Raymond
  2019-12-27 21:29                           ` Andreas Schwab
  2 siblings, 0 replies; 54+ messages in thread
From: Eric S. Raymond @ 2019-12-25 19:40 UTC (permalink / raw)
  To: Joseph Myers
  Cc: Andreas Schwab, Roman Zhuykov, Segher Boessenkool, gcc,
	Alexander Monakov, Maxim Kuvyrkov

Joseph Myers <jsm@polyomino.org.uk>:
> On Wed, 25 Dec 2019, Andreas Schwab wrote:
> 
> > On Dez 25 2019, Joseph Myers wrote:
> > 
> > > Timezones for any email address can be specified in gcc.map for any 
> > > authors wishing to have an appropriate timezone used for their commits.
> > 
> > But that should not be used for unrelated authors.
> 
> It's not.
> 
> On investigation, I think you are referring to the conversion of r269472.  
> That was committed for you by Jim Wilson and thus has you as author and 
> Jim Wilson as committer and Jim Wilson's timezone entry has been applied.  
> So the argument here is that the author's timezone information should be 
> applied to the author date, and the committer's timezone information 
> should be applied to the committer date.  I expect that should be 
> straightforward (although when coming from SVN, there's also an argument 
> that we only have committer dates so the committer timezone is the 
> relevant one to apply).

Theee's also an FSF policy about Changelogs that's relevant, I think.

Git sometimes fills in the author field from the committer, and
Changelog parsing is done only after translation. That's probably the
source of this bug.

If anybody cares enough to file a bug with a test load attached, I
can probably fix this.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2019-12-25 17:15                           ` Segher Boessenkool
@ 2019-12-25 19:33                             ` Eric S. Raymond
  2019-12-26 21:03                               ` Vincent Lefevre
  0 siblings, 1 reply; 54+ messages in thread
From: Eric S. Raymond @ 2019-12-25 19:33 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Joseph Myers, Andreas Schwab, Roman Zhuykov, gcc,
	Alexander Monakov, Maxim Kuvyrkov

Segher Boessenkool <segher@kernel.crashing.org>:
> The goal is not to pretend we never used SVN.

One of *my* goals is that the illusion of git back to the beginning of
time should be as consistent as possible.

> The goal is to have a Git repo that is as useful as possible for us.

Exactly.  I've already written about minimizing cognitive friction.

Here's why you want to get timezones right: there are going to be times
when the order of commits is significant information for a developer's
understanding of what happened.  But without a timezone you only know 
the actual time of a commit to 24-hour resoltion.

There is no way we'll get this perfect.  But there is more wrong and
less wrong, and reposurgeon tries hard to be less wrong.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2019-12-25 14:32                   ` Andreas Schwab
  2019-12-25 14:41                     ` Joseph Myers
@ 2019-12-25 19:19                     ` Eric S. Raymond
  2019-12-27 21:30                       ` Andreas Schwab
  1 sibling, 1 reply; 54+ messages in thread
From: Eric S. Raymond @ 2019-12-25 19:19 UTC (permalink / raw)
  To: Andreas Schwab
  Cc: Roman Zhuykov, Segher Boessenkool, Joseph Myers, gcc,
	Alexander Monakov, Maxim Kuvyrkov

Andreas Schwab <schwab@linux-m68k.org>:
> Definitely not.  I have never authored or committed any revision in the
> -0800 time zone.

That's easily fixed by adding a timezone entry to your author-map
entry - CET, is it?  That will prevent reposurgeon from making any
attempt to deduce your timezone.

It would be interesting to know how reposurgeon got misled.  Most
likely it was by a Changelog entry.  Reposurgeon watches as these are
being processed to see if it can pin an email address to a single timezone
by looking up its TLD in the IANA database.

I don't know how that could land you in California, though. Maybe
I ought to be logging timezone deductions so we can trace them back.

Has anyone else seen wrong timezone attributions?
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2019-12-25 15:36                         ` Joseph Myers
@ 2019-12-25 17:15                           ` Segher Boessenkool
  2019-12-25 19:33                             ` Eric S. Raymond
  2019-12-25 19:40                           ` Eric S. Raymond
  2019-12-27 21:29                           ` Andreas Schwab
  2 siblings, 1 reply; 54+ messages in thread
From: Segher Boessenkool @ 2019-12-25 17:15 UTC (permalink / raw)
  To: Joseph Myers
  Cc: Andreas Schwab, Roman Zhuykov, gcc, Alexander Monakov,
	Maxim Kuvyrkov, esr

On Wed, Dec 25, 2019 at 03:36:38PM +0000, Joseph Myers wrote:
> On Wed, 25 Dec 2019, Andreas Schwab wrote:
> 
> > On Dez 25 2019, Joseph Myers wrote:
> > 
> > > Timezones for any email address can be specified in gcc.map for any 
> > > authors wishing to have an appropriate timezone used for their commits.
> > 
> > But that should not be used for unrelated authors.
> 
> It's not.
> 
> On investigation, I think you are referring to the conversion of r269472.  
> That was committed for you by Jim Wilson and thus has you as author and 
> Jim Wilson as committer and Jim Wilson's timezone entry has been applied.  
> So the argument here is that the author's timezone information should be 
> applied to the author date, and the committer's timezone information 
> should be applied to the committer date.  I expect that should be 
> straightforward (although when coming from SVN, there's also an argument 
> that we only have committer dates so the committer timezone is the 
> relevant one to apply).

Or we could just not make up any time zone at all.  The information
isn't there, what is gained by faking something?

Having people's real names is obviously useful.  Showing the email
address they used when they did the patch (which can be an indication of
affiliation, for example) can also be useful, but less so, and is harder
to get right.  But the timezone some patch was made in (or committed in)?

The goal is not to pretend we never used SVN.  The goal is to have a Git
repo that is as useful as possible for us.  For me, that means the stuff
inherited from the older repos should be just that: exactly what was
there before.  With annoyances like real name fixed, perhaps, and maybe
actual errors fixed (although I never in practice saw *any* error that
made anything even the slightest bit harder to do).

But no lipstick.


Segher

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2019-12-25 15:10                       ` Andreas Schwab
@ 2019-12-25 15:36                         ` Joseph Myers
  2019-12-25 17:15                           ` Segher Boessenkool
                                             ` (2 more replies)
  0 siblings, 3 replies; 54+ messages in thread
From: Joseph Myers @ 2019-12-25 15:36 UTC (permalink / raw)
  To: Andreas Schwab
  Cc: Roman Zhuykov, Segher Boessenkool, gcc, Alexander Monakov,
	Maxim Kuvyrkov, esr

On Wed, 25 Dec 2019, Andreas Schwab wrote:

> On Dez 25 2019, Joseph Myers wrote:
> 
> > Timezones for any email address can be specified in gcc.map for any 
> > authors wishing to have an appropriate timezone used for their commits.
> 
> But that should not be used for unrelated authors.

It's not.

On investigation, I think you are referring to the conversion of r269472.  
That was committed for you by Jim Wilson and thus has you as author and 
Jim Wilson as committer and Jim Wilson's timezone entry has been applied.  
So the argument here is that the author's timezone information should be 
applied to the author date, and the committer's timezone information 
should be applied to the committer date.  I expect that should be 
straightforward (although when coming from SVN, there's also an argument 
that we only have committer dates so the committer timezone is the 
relevant one to apply).

-- 
Joseph S. Myers
jsm@polyomino.org.uk

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2019-12-25 14:41                     ` Joseph Myers
@ 2019-12-25 15:10                       ` Andreas Schwab
  2019-12-25 15:36                         ` Joseph Myers
  0 siblings, 1 reply; 54+ messages in thread
From: Andreas Schwab @ 2019-12-25 15:10 UTC (permalink / raw)
  To: Joseph Myers
  Cc: Roman Zhuykov, Segher Boessenkool, gcc, Alexander Monakov,
	Maxim Kuvyrkov, esr

On Dez 25 2019, Joseph Myers wrote:

> Timezones for any email address can be specified in gcc.map for any 
> authors wishing to have an appropriate timezone used for their commits.

But that should not be used for unrelated authors.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2019-12-25 14:32                   ` Andreas Schwab
@ 2019-12-25 14:41                     ` Joseph Myers
  2019-12-25 15:10                       ` Andreas Schwab
  2019-12-25 19:19                     ` Eric S. Raymond
  1 sibling, 1 reply; 54+ messages in thread
From: Joseph Myers @ 2019-12-25 14:41 UTC (permalink / raw)
  To: Andreas Schwab
  Cc: Roman Zhuykov, Segher Boessenkool, gcc, Alexander Monakov,
	Maxim Kuvyrkov, esr

On Wed, 25 Dec 2019, Andreas Schwab wrote:

> > Not sure we need that info, but reposurgeon is more correct here.
> 
> Definitely not.  I have never authored or committed any revision in the
> -0800 time zone.

If reposurgeon is defaulting to the local time where the conversion is 
run, there's a strong argument it should default to UTC to be 
deterministic.  I've made to note to make the gcc-conversion machinery use 
TZ=UTC0 to avoid such issues.

Timezones for any email address can be specified in gcc.map for any 
authors wishing to have an appropriate timezone used for their commits.

-- 
Joseph S. Myers
jsm@polyomino.org.uk

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2019-12-25 11:03                 ` Roman Zhuykov
  2019-12-25 11:20                   ` Joseph Myers
@ 2019-12-25 14:32                   ` Andreas Schwab
  2019-12-25 14:41                     ` Joseph Myers
  2019-12-25 19:19                     ` Eric S. Raymond
  2019-12-27 14:37                   ` Richard Earnshaw
  2 siblings, 2 replies; 54+ messages in thread
From: Andreas Schwab @ 2019-12-25 14:32 UTC (permalink / raw)
  To: Roman Zhuykov
  Cc: Segher Boessenkool, Joseph Myers, gcc, Alexander Monakov,
	Maxim Kuvyrkov, esr

On Dez 25 2019, Roman Zhuykov wrote:

> 1) In gcc-pretty timezone info is lost in both author/commiter date
> (keeping UTC time correct, certainly).

Since svn doesn't record time zones you cannot lose them, only fabricate
them.

> Not sure we need that info, but reposurgeon is more correct here.

Definitely not.  I have never authored or committed any revision in the
-0800 time zone.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2019-12-25 11:20                   ` Joseph Myers
@ 2019-12-25 12:23                     ` Eric S. Raymond
  0 siblings, 0 replies; 54+ messages in thread
From: Eric S. Raymond @ 2019-12-25 12:23 UTC (permalink / raw)
  To: Joseph Myers
  Cc: Roman Zhuykov, Segher Boessenkool, gcc, Alexander Monakov,
	Maxim Kuvyrkov, Richard.Earnshaw, rearnsha

Joseph Myers <jsm@polyomino.org.uk>:
> These are all cases covered by the request-for-enhancement issue for 
> adding Co-Authored-by: when the ChangeLog header names multiple authors, 
> as the corresponding de facto git idiom for that case.

I apologize, but I am growing doubtful I can deliver that.  Even if I
can, it may take longer than your conversion schedule allows given
that we've only got five days on the clock.  Here are the problems:

1. I don't have a reduced test case to validate parsing against.

2. The ChangeLog-parsing code is fragile and difficult to modify.
   This is inherent - the syntactic cues it's working with are weak
   and false matches are all too easy.
   
I've got to have 1 before I can even try to deal with 2.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2019-12-25 11:03                 ` Roman Zhuykov
@ 2019-12-25 11:20                   ` Joseph Myers
  2019-12-25 12:23                     ` Eric S. Raymond
  2019-12-25 14:32                   ` Andreas Schwab
  2019-12-27 14:37                   ` Richard Earnshaw
  2 siblings, 1 reply; 54+ messages in thread
From: Joseph Myers @ 2019-12-25 11:20 UTC (permalink / raw)
  To: Roman Zhuykov
  Cc: Segher Boessenkool, gcc, Alexander Monakov, Maxim Kuvyrkov, esr,
	Richard.Earnshaw, rearnsha

On Wed, 25 Dec 2019, Roman Zhuykov wrote:

> 2) Some thoughts about script for summarizing commit log messages:
> 2a) Why r143753 and r150680 not have "re PR..." summary instead of "[multiple
> changes]" ?
> 2b) On the contrary r155892 have to mention two PRs, even "[multiple changes]"
> is better here, IMHO.
> 2c) In r130050 and r155902 we have "Rename too ... " in summary, not sure how
> to make it better.
> 2d) r146882 can have better summary if we somehow organize ChangeLog priority
> (gcc/ChangeLog is more important that testsuite one).

Richard is best placed to comment on these.  His script can provide a 
complete new summary line if the automatically-generated one seems bad.

> 3a) In r155892, r155893 and r259314 Alex is not counted as the only author
> without any reason.

The first two look like cases where the only difference is in the number 
of spaces between name and email in the attributions in different 
ChangeLog files.  Should be straightforward to fix by doing more parsing / 
normalization before deciding whether attributions are the same.

The third is a case where the heuristic is applied that if a commit only 
changes ChangeLog files and nothing else, attributions should not be 
extracted from those ChangeLog files because it's particularly likely in 
that case the someone else's ChangeLog entries may be being edited.

> 3b) In r139854, r141108 and r196252 script selected the author successfully,
> while actually there are more that one.

These are all cases covered by the request-for-enhancement issue for 
adding Co-Authored-by: when the ChangeLog header names multiple authors, 
as the corresponding de facto git idiom for that case.

> 3e) r155891, r207422, r183627 and r234218 are examples of commits which don't
> touch any ChangeLog files for different reasons. Seems unsolvable in current
> approach.

If a ChangeLog file isn't touched, indeed we don't have a good basis for 
using an author identity other than the committer identity (especially 
given that some people used multiple email addresses simultaneously, with 
different ones used for different kinds of commits, and objected to having 
one with the wrong affiliation associated with a commit they made in 
connection with a different affiliation).

-- 
Joseph S. Myers
jsm@polyomino.org.uk

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2019-12-24 18:14               ` Segher Boessenkool
@ 2019-12-25 11:03                 ` Roman Zhuykov
  2019-12-25 11:20                   ` Joseph Myers
                                     ` (2 more replies)
  0 siblings, 3 replies; 54+ messages in thread
From: Roman Zhuykov @ 2019-12-25 11:03 UTC (permalink / raw)
  To: Segher Boessenkool, Joseph Myers
  Cc: gcc, Alexander Monakov, Maxim Kuvyrkov, esr

First of all thanks to everyone who spent time making the conversion 
better and better. Here is my 2c, I have studied a little my colleagues 
trunk history in Maxim's gcc-pretty vs gcc-reposurgeon-5b.

1) In gcc-pretty timezone info is lost in both author/commiter date 
(keeping UTC time correct, certainly). Examples are r278990 and r289989.
Probably git-svn causes this, current read-only git mirror is also 
without timezone. Not sure we need that info, but reposurgeon is more 
correct here.

2) Some thoughts about script for summarizing commit log messages:
2a) Why r143753 and r150680 not have "re PR..." summary instead of 
"[multiple changes]" ?
2b) On the contrary r155892 have to mention two PRs, even "[multiple 
changes]" is better here, IMHO.
2c) In r130050 and r155902 we have "Rename too ... " in summary, not 
sure how to make it better.
2d) r146882 can have better summary if we somehow organize ChangeLog 
priority (gcc/ChangeLog is more important that testsuite one).

3) About author emails, see below
24.12.2019 21:14, Segher Boessenkool wrote:
> On Tue, Dec 24, 2019 at 05:16:54PM +0000, Joseph Myers wrote:
>> On Tue, 24 Dec 2019, Segher Boessenkool wrote:
>>>> That's because that commit also edits ChangeLog entries from other
>>>> authors.  When a commit adds / edits ChangeLog entries for more than one
>>>> author (the difference between purely editing an existing entry and adding
>>>> a new one, possibly under an existing date/author header, for a
>>>> multi-author commit, is not something that can reliably be determined
>>>> automatically), the conversion falls back to using the committer identity
>>>> instead of picking one of the multiple relevant authors from the ChangeLog
>>>> files.
>>> There is only one relevant author in r270511.  It edits a few wrong path
>>> names in the previous changelog entries.  People often do similar things
>>> (like fixing the commit date :-) )
>> Distinguishing "edits a previous ChangeLog entry" from "adds a new entry
>> under a previous ChangeLog header for a change included in the commit" is
>> a human judgement.
> We are doing only one conversion here, the one of the GCC repo.  The
> heuristic works, we checked it did.
>
>>> Either never use <account>@gcc.gnu.org, or always use it, don't do the
>>> worst of both worlds?
>> The heuristics here are to use an attribution from ChangeLog for the
>> author where unambiguous, but to use the committer (always @gcc.gnu.org /
>> @gnu.org [*], so avoiding attributions at the wrong company even where
>> people were using multiple addresses simultaneously for different changes)
>> as author if in doubt.
> You never need that, and it is worse to use two different schemes than to
> choose either.
>
> I would have chosen the "<account>@gcc.gnu.org" scheme, because it is
> simple and *correct*.  Other people wanted the nicer names.  Maxim's
> conversion gets that correct.  Please copy it.
>
IMHO Segher is a bit categorical is the discussion, but I'll be glad to 
see brief description of Maxim's approach to manage emails, gcc-pretty 
shows better results.
Speaking about the script counting authors from ChangeLog files, even if 
we drop an "edits a previous ChangeLog entry" issue, it still sometimes 
work not as Joseph described:
3a) In r155892, r155893 and r259314 Alex is not counted as the only 
author without any reason.
3b) In r139854, r141108 and r196252 script selected the author 
successfully, while actually there are more that one.
3c) Maybe here we can also somehow organize ChangeLog priority (again, 
gcc/ChangeLog is more important that testsuite one). There are a lot of 
examples, when testsuite/ChangeLog entry have another author: r145055, 
r150680, r155889, r155894, r155890, r163904, r180186 and r183325.
3d) If we fix 3b+3c we can also look at r143753, r155890 and r155895.
3e) r155891, r207422, r183627 and r234218 are examples of commits which 
don't touch any ChangeLog files for different reasons. Seems unsolvable 
in current approach.

--
Roman

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2019-12-24 17:17             ` Joseph Myers
@ 2019-12-24 18:14               ` Segher Boessenkool
  2019-12-25 11:03                 ` Roman Zhuykov
  0 siblings, 1 reply; 54+ messages in thread
From: Segher Boessenkool @ 2019-12-24 18:14 UTC (permalink / raw)
  To: Joseph Myers; +Cc: Roman Zhuykov, gcc, esr

On Tue, Dec 24, 2019 at 05:16:54PM +0000, Joseph Myers wrote:
> On Tue, 24 Dec 2019, Segher Boessenkool wrote:
> > > That's because that commit also edits ChangeLog entries from other 
> > > authors.  When a commit adds / edits ChangeLog entries for more than one 
> > > author (the difference between purely editing an existing entry and adding 
> > > a new one, possibly under an existing date/author header, for a 
> > > multi-author commit, is not something that can reliably be determined 
> > > automatically), the conversion falls back to using the committer identity 
> > > instead of picking one of the multiple relevant authors from the ChangeLog 
> > > files.
> > 
> > There is only one relevant author in r270511.  It edits a few wrong path
> > names in the previous changelog entries.  People often do similar things
> > (like fixing the commit date :-) )
> 
> Distinguishing "edits a previous ChangeLog entry" from "adds a new entry 
> under a previous ChangeLog header for a change included in the commit" is 
> a human judgement.

We are doing only one conversion here, the one of the GCC repo.  The
heuristic works, we checked it did.

> > Either never use <account>@gcc.gnu.org, or always use it, don't do the
> > worst of both worlds?
> 
> The heuristics here are to use an attribution from ChangeLog for the 
> author where unambiguous, but to use the committer (always @gcc.gnu.org / 
> @gnu.org [*], so avoiding attributions at the wrong company even where 
> people were using multiple addresses simultaneously for different changes) 
> as author if in doubt.

You never need that, and it is worse to use two different schemes than to
choose either.

I would have chosen the "<account>@gcc.gnu.org" scheme, because it is
simple and *correct*.  Other people wanted the nicer names.  Maxim's
conversion gets that correct.  Please copy it.

If your tool isn't sure what to do, use human intervention.  For example,
make up a heuristic, and check that exhaustively.  We have only one repo
to convert!

And people do *not* have the same email address for the whole lifetime
of the repo.  This would mean I can never again contribute to GCC if I
start using a different email address after the conversion!


Segher

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2019-12-24 15:55           ` Segher Boessenkool
@ 2019-12-24 17:17             ` Joseph Myers
  2019-12-24 18:14               ` Segher Boessenkool
  0 siblings, 1 reply; 54+ messages in thread
From: Joseph Myers @ 2019-12-24 17:17 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: Roman Zhuykov, gcc, esr

On Tue, 24 Dec 2019, Segher Boessenkool wrote:

> > That's because that commit also edits ChangeLog entries from other 
> > authors.  When a commit adds / edits ChangeLog entries for more than one 
> > author (the difference between purely editing an existing entry and adding 
> > a new one, possibly under an existing date/author header, for a 
> > multi-author commit, is not something that can reliably be determined 
> > automatically), the conversion falls back to using the committer identity 
> > instead of picking one of the multiple relevant authors from the ChangeLog 
> > files.
> 
> There is only one relevant author in r270511.  It edits a few wrong path
> names in the previous changelog entries.  People often do similar things
> (like fixing the commit date :-) )

Distinguishing "edits a previous ChangeLog entry" from "adds a new entry 
under a previous ChangeLog header for a change included in the commit" is 
a human judgement.  (It's necessary to consider the case of a ChangeLog 
header not included in the new lines added by the commit when looking for 
an attribution in a ChangeLog file because multiple ChangeLog stanzas, 
separated by blank lines, under a single date/author header, is common 
usage for multiple consecutive commits by the same author, and sometimes 
non-consecutive commits depending on how they did merging.)

> Either never use <account>@gcc.gnu.org, or always use it, don't do the
> worst of both worlds?

The heuristics here are to use an attribution from ChangeLog for the 
author where unambiguous, but to use the committer (always @gcc.gnu.org / 
@gnu.org [*], so avoiding attributions at the wrong company even where 
people were using multiple addresses simultaneously for different changes) 
as author if in doubt.  I think that's better than either always or never 
using @gcc.gnu.org.  git tools computing commit statistics can generally 
handle multiple email addresses with the same name automatically, and 
.mailmap can be used to specify more detailed mappings for past commits 
for at least git shortlog.

[*] Except in the special case of "rolfh" from the gcc2 history, where we 
identified the author of the changes they committed but not who that user 
was as a committer, so gcc.map specifies that author identity.

-- 
Joseph S. Myers
jsm@polyomino.org.uk

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2019-12-24 11:50         ` Joseph Myers
@ 2019-12-24 15:55           ` Segher Boessenkool
  2019-12-24 17:17             ` Joseph Myers
  0 siblings, 1 reply; 54+ messages in thread
From: Segher Boessenkool @ 2019-12-24 15:55 UTC (permalink / raw)
  To: Joseph Myers; +Cc: Roman Zhuykov, gcc, esr

On Tue, Dec 24, 2019 at 11:50:30AM +0000, Joseph Myers wrote:
> On Mon, 23 Dec 2019, Roman Zhuykov wrote:
> > I've never used zhroma@gcc.gnu.org email in ChangeLog files. So, it seems odd
> > that it is used in r270511 (my first commit as maintainer), but not in next
> 
> That's because that commit also edits ChangeLog entries from other 
> authors.  When a commit adds / edits ChangeLog entries for more than one 
> author (the difference between purely editing an existing entry and adding 
> a new one, possibly under an existing date/author header, for a 
> multi-author commit, is not something that can reliably be determined 
> automatically), the conversion falls back to using the committer identity 
> instead of picking one of the multiple relevant authors from the ChangeLog 
> files.

There is only one relevant author in r270511.  It edits a few wrong path
names in the previous changelog entries.  People often do similar things
(like fixing the commit date :-) )

Either never use <account>@gcc.gnu.org, or always use it, don't do the
worst of both worlds?


Maxim's conversion has this just fine: (from gcc-reparent):

commit 6d4633c4d15a92b88332c1e0cbc7f5c1c93c1a8a
Author:     Roman Zhuykov <zhroma@ispras.ru>
AuthorDate: Tue Apr 23 12:53:43 2019 +0000
Commit:     Roman Zhuykov <zhroma@ispras.ru>
CommitDate: Tue Apr 23 12:53:43 2019 +0000

    modulo-sched: fix branch scheduling issue (PR84032)
    
        PR rtl-optimization/84032
        * modulo-sched.c (ps_insn_find_column): Change condition so that
        branch will always be the last insn in a row inside partial
        schedule.
    
    testsuite:
    
        PR rtl-optimization/84032
        * gcc.dg/pr84032.c: New test.
    
    
    git-svn-id: https://gcc.gnu.org/svn/gcc/trunk@270511 138bc75d-0d04-0410-961f

(It also gets different author and committer right, and changing email
addresses over time).


Segher

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2019-12-23 17:27       ` Roman Zhuykov
@ 2019-12-24 11:50         ` Joseph Myers
  2019-12-24 15:55           ` Segher Boessenkool
  0 siblings, 1 reply; 54+ messages in thread
From: Joseph Myers @ 2019-12-24 11:50 UTC (permalink / raw)
  To: Roman Zhuykov; +Cc: gcc, esr

On Mon, 23 Dec 2019, Roman Zhuykov wrote:

> I've never used zhroma@gcc.gnu.org email in ChangeLog files. So, it seems odd
> that it is used in r270511 (my first commit as maintainer), but not in next

That's because that commit also edits ChangeLog entries from other 
authors.  When a commit adds / edits ChangeLog entries for more than one 
author (the difference between purely editing an existing entry and adding 
a new one, possibly under an existing date/author header, for a 
multi-author commit, is not something that can reliably be determined 
automatically), the conversion falls back to using the committer identity 
instead of picking one of the multiple relevant authors from the ChangeLog 
files.

-- 
Joseph S. Myers
jsm@polyomino.org.uk

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2019-12-22 13:57     ` Joseph Myers
  2019-12-23 17:27       ` Roman Zhuykov
@ 2019-12-24 10:57       ` Maxim Kuvyrkov
  2019-12-28 16:30       ` Joseph Myers
  2 siblings, 0 replies; 54+ messages in thread
From: Maxim Kuvyrkov @ 2019-12-24 10:57 UTC (permalink / raw)
  To: Joseph S. Myers; +Cc: gcc, esr

> On Dec 22, 2019, at 4:56 PM, Joseph Myers <joseph@codesourcery.com> wrote:
> 
> On Thu, 19 Dec 2019, Joseph Myers wrote:
> 
>> And two more.
>> 
>> git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-4a.git
>> git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-4b.git
> 
> Two more.
> 
> git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-5a.git
> git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-5b.git
> 
> The main changes are:
> 
> * The case of both svnmerge-integrated and svn:mergeinfo being set is now 
> handled properly, so the commit Bernd found is interpreted as a merge from 
> trunk to named-addr-spaces-branch and has exactly two parents as expected, 
> with the parents corresponding to the merges from other branches to trunk 
> being optimized away.
> 
> * The author map used now avoids timezone-only entries also remapping 
> email addresses, so the email addresses from the ChangeLogs are used 
> whenever a commit adds ChangeLog entries from exactly one author.
> 
> * When commits add ChangeLog entries from more than one author (e.g. 
> merges done in CVS), the committer is now used as the author rather than 
> selecting one of the authors from the ChangeLog entries.
> 
> * The latest whitelisting / PR corrections are used with Richard's script 
> (430 checkme: entries remain).
> 
> * One fix to the ref renaming in gcc-reposurgeon-5b.git so that the tag 
> gcc-3_2-rhl8-3_2-7 properly ends up in vendors rather than prereleases.

I'll spend next couple of days comparing Joseph's gcc-reposurgeon-5a.git conversion against my gcc-pretty.git and gcc-reparent.git conversions, and will post results along with the scripts to this mailing list.

Regarding gcc-pretty.git and gcc-reparent.git conversions, I have the following comments so far:

Q1: Why are there missing branches for stuff that didn't originate at trunk@1?
A1: Indeed, that's by design / configuration.  The scripts start with trunk@1 and build a parent DAG from that node.  If desired, it is trivial to add more initial "root" commits to include these missing branches.

Q2: Why are entries from branches/st/tags treated as branches, not as tags?
A2: Because I opted to not special-case these to simplify comparison of different conversions.  Tags/* entries are converted to git annotated tags in a separate pass, an it is trivial to add handling for branches/st/tags there.

Q3: Why do reparented branches in gcc-reparent.git repo have merge commits at the point of reparenting?
A3: That's an artifact of svn-git machinery my scripts are using.  I haven't looked at this in depth.

Q4: Is it possible to integrate Richard E.'s script to rewrite commit log messages?
A5: Yes, absolutely.  The scripts have a pass to rewrite commit author/committer entries, and log rewrite easily fits in there.  It would be very helpful to have a version of Richard's script that runs on per-commit basis, suitable for "git filter-branch" consumption.

Regards,

--
Maxim Kuvyrkov
https://www.linaro.org



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2019-12-22 13:57     ` Joseph Myers
@ 2019-12-23 17:27       ` Roman Zhuykov
  2019-12-24 11:50         ` Joseph Myers
  2019-12-24 10:57       ` Maxim Kuvyrkov
  2019-12-28 16:30       ` Joseph Myers
  2 siblings, 1 reply; 54+ messages in thread
From: Roman Zhuykov @ 2019-12-23 17:27 UTC (permalink / raw)
  To: Joseph Myers, gcc; +Cc: esr

22.12.2019 16:56, Joseph Myers wrote:
> On Thu, 19 Dec 2019, Joseph Myers wrote:
>
>> And two more.
>>
>> git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-4a.git
>> git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-4b.git
> Two more.
>
> git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-5a.git
> git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-5b.git
>
> The main changes are:
>
> * The author map used now avoids timezone-only entries also remapping
> email addresses, so the email addresses from the ChangeLogs are used
> whenever a commit adds ChangeLog entries from exactly one author.
Hello.

Sorry if I have missed some part of discussion and now describe known 
issue about "author map", but in gcc-reposurgeon-5b.git I see this:

gcc-reposurgeon-5b.git$ git log --pretty=format:"%h%x09 %ae%x09%ad" 
--author=zhroma

ff96b83  zhroma@ispras.ru       Fri Dec 20 15:40:46 2019 +0000
17a6791  zhroma@ispras.ru       Fri Dec 13 17:33:38 2019 +0000
86f05d3  zhroma@ispras.ru       Fri Dec 13 17:17:31 2019 +0000
7103c52  zhroma@ispras.ru       Fri Dec 13 17:02:53 2019 +0000
1e035ae  zhroma@ispras.ru       Tue Apr 23 13:14:57 2019 +0000
cea94d2  zhroma@gcc.gnu.org     Tue Apr 23 12:53:43 2019 +0000
28635d5  zhroma@ispras.ru       Mon Apr 22 16:05:36 2019 +0000
2f59c1e  zhroma@ispras.ru       Fri Mar 29 12:44:01 2019 -0600
1cc4c09  zhroma@ispras.ru       Fri Feb 10 16:00:30 2012 +0400
9839b87  zhroma@ispras.ru       Mon Jul 25 13:43:01 2011 +0400

gcc-reposurgeon-5b.git$ git log --pretty=format:"%h%x09 %ae%x09%ad" 
--author=zhroma releases/gcc-8
40cc006  zhroma@ispras.ru       Fri Dec 20 15:52:02 2019 +0000
4cf66d9  zhroma@ispras.ru       Fri Dec 20 15:07:58 2019 +0000
9240950  zhroma@gcc.gnu.org     Fri Apr 26 16:04:54 2019 +0000
1cc4c09  zhroma@ispras.ru       Fri Feb 10 16:00:30 2012 +0400
9839b87  zhroma@ispras.ru       Mon Jul 25 13:43:01 2011 +0400

I've never used zhroma@gcc.gnu.org email in ChangeLog files. So, it 
seems odd that it is used in r270511 (my first commit as maintainer), 
but not in next r270512 or later commits. Moreover, it is also used once 
in r270609 on 8 branch. I think it's better to use zhroma@ispras.ru 
everywhere. Another option may be using @gcc.gnu.org everywhere since 
r270511.

I also see this discussion 
https://gcc.gnu.org/ml/gcc/2019-09/msg00216.html, but it was about 
wwwdocs repo.

Committer field is correct in repo, "Roman Zhuykov <zhroma@gcc.gnu.org>" 
is used everywhere.

--

Roman

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2019-12-19 16:29   ` Joseph Myers
@ 2019-12-22 13:57     ` Joseph Myers
  2019-12-23 17:27       ` Roman Zhuykov
                         ` (2 more replies)
  0 siblings, 3 replies; 54+ messages in thread
From: Joseph Myers @ 2019-12-22 13:57 UTC (permalink / raw)
  To: gcc; +Cc: esr

On Thu, 19 Dec 2019, Joseph Myers wrote:

> And two more.
> 
> git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-4a.git
> git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-4b.git

Two more.

git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-5a.git
git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-5b.git

The main changes are:

* The case of both svnmerge-integrated and svn:mergeinfo being set is now 
handled properly, so the commit Bernd found is interpreted as a merge from 
trunk to named-addr-spaces-branch and has exactly two parents as expected, 
with the parents corresponding to the merges from other branches to trunk 
being optimized away.

* The author map used now avoids timezone-only entries also remapping 
email addresses, so the email addresses from the ChangeLogs are used 
whenever a commit adds ChangeLog entries from exactly one author.

* When commits add ChangeLog entries from more than one author (e.g. 
merges done in CVS), the committer is now used as the author rather than 
selecting one of the authors from the ChangeLog entries.

* The latest whitelisting / PR corrections are used with Richard's script 
(430 checkme: entries remain).

* One fix to the ref renaming in gcc-reposurgeon-5b.git so that the tag 
gcc-3_2-rhl8-3_2-7 properly ends up in vendors rather than prereleases.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2019-12-18 21:55 ` Joseph Myers
  2019-12-19  0:36   ` Bernd Schmidt
@ 2019-12-19 16:29   ` Joseph Myers
  2019-12-22 13:57     ` Joseph Myers
  1 sibling, 1 reply; 54+ messages in thread
From: Joseph Myers @ 2019-12-19 16:29 UTC (permalink / raw)
  To: gcc; +Cc: esr

On Wed, 18 Dec 2019, Joseph Myers wrote:

> There are now four more repositories available.
> 
> git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-2a.git
> git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-2b.git
> git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-3a.git
> git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-3b.git

And two more.

git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-4a.git
git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-4b.git

The main changes in this version are:

* More mergeinfo improvements, so more valid merges should be detected and 
represented as such in git.  (This doesn't yet have the changes to handle 
the case of both svnmerge-integrated and svn:mergeinfo having relevant 
merge information.)

* More bug data used with Richard's script (but not the most recent 
whitelisting / PR corrections).

* @gcc.gnu.org / @gnu.org addresses are preferred in the author map, to 
avoid anachronistic credits of commits to addresses at the wrong company 
(this also means that the git "committer" information consistently uses 
such addresses, which is certainly logically correct).  I preserved 
timezone annotations in the map for existing addresses but accidentally 
did that in a way that resulted in those existing addresses, when used in 
ChangeLog entries, being mapped to the @gcc.gnu.org / @gnu.org ones; I'll 
fix that for the next conversion run so the addresses from the ChangeLog 
entries are properly preserved in such cases but still use the timezone 
annotations from gcc.map.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2019-12-19  5:50     ` Jason Merrill
@ 2019-12-19 15:55       ` Joseph Myers
  0 siblings, 0 replies; 54+ messages in thread
From: Joseph Myers @ 2019-12-19 15:55 UTC (permalink / raw)
  To: Jason Merrill; +Cc: gcc Mailing List, Eric Raymond

On Thu, 19 Dec 2019, Jason Merrill wrote:

> So a 30% space savings; that's pretty significant.  Though I wonder how
> much of that is refs/dead and refs/deleted, which seem unnecessary to carry
> over to git at all.  I wonder if it would make sense to put them in a
> separate repository that refers to the main gcc.git?

refs/dead is definitely relevant sometimes; that's old development 
branches.  refs/deleted is less clearly relevant.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2019-12-18 18:16   ` Joseph Myers
@ 2019-12-19  5:50     ` Jason Merrill
  2019-12-19 15:55       ` Joseph Myers
  0 siblings, 1 reply; 54+ messages in thread
From: Jason Merrill @ 2019-12-19  5:50 UTC (permalink / raw)
  To: Joseph Myers; +Cc: gcc Mailing List, Eric Raymond

On Wed, Dec 18, 2019 at 1:17 PM Joseph Myers <joseph@codesourcery.com>
wrote:

> On Wed, 18 Dec 2019, Jason Merrill wrote:
>
> > On Tue, Dec 17, 2019 at 4:39 PM Joseph Myers <joseph@codesourcery.com>
> > wrote:
> >
> > > Points for consideration:
> > >
> > > 1. Do we want some kind of rearrangement of refs as in the 1b
> > > repository or not?
> > >
> >
> > Maybe?  How much space does that save in a clone?  How much work does a
> > partial clone add on the server, since the server needs to pack up the
> > objects for the partial clone rather than just transmitting its own
> packs?
>
> I haven't measured work on the server, and timing individual clones is
> liable to a lot of variation from variable load there, but for a single
> clone --mirror of the 1b repository (so all refs, including refs/deleted/)
> I got
>
> real    13m16.473s
> user    16m45.429s
> sys     0m33.901s
>
> and 1360 MB objects directory, but for a clone without --mirror (so only a
> limited subset of refs and the server needing to build a pack)
>
> real    15m5.554s
> user    12m11.771s
> sys     0m26.914s
>
> and 950 MB objects directory.  Adding the objects from the existing
> git-svn mirror (presumably also under refs not fetched by default)
> increases repository size by about 300 MB, based on a previous test of
> doing that (most blob and tree objects will be shared between the two
> versions of the history, but all the commit objects are separate).
>

So a 30% space savings; that's pretty significant.  Though I wonder how
much of that is refs/dead and refs/deleted, which seem unnecessary to carry
over to git at all.  I wonder if it would make sense to put them in a
separate repository that refers to the main gcc.git?

Jason

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2019-12-19  0:36   ` Bernd Schmidt
@ 2019-12-19  0:58     ` Joseph Myers
  0 siblings, 0 replies; 54+ messages in thread
From: Joseph Myers @ 2019-12-19  0:58 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: gcc, esr

On Thu, 19 Dec 2019, Bernd Schmidt wrote:

> On 12/18/19 10:55 PM, Joseph Myers wrote:
> > git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-3a.git
> 
> I cloned this one and started trying random things again.
> The previous one had some strange-looking merge commits, but it sounded like
> that was a known issue, and indeed the ones I had seen were fixed in this new
> version.
> I decided to write a small script to check whether there were any merge
> commits with more than two parents, and there are a few which have three. One
> commit seems to occur as a parent in all of these:
> 422854db0e8605867e0834035aa2b1da1b71cbfb. An example is
> b743e467e43e6211f2c2537f1f07bbceb4d3aa61, apparently from spu-4_5-branch.
> 
> No idea whether there is an issue or whether this is worth looking at, but I
> figured I'd point it out at least.

b743e467e43e6211f2c2537f1f07bbceb4d3aa61 is r152464, from 
named-addr-spaces-branch.

This merge is the first one that added an svn:mergeinfo property on 
named-addr-spaces-branch, which previously had only svnmerge-integrated 
(the svn:mergeinfo property was copied from the one that was on trunk at 
that time).  That svn:mergeinfo property specifies merges from 
/branches/cxx0x-lambdas-branch, /branches/lto and /trunk, and the merge 
parents are for the first two of those.  For /trunk, it only specifies two 
revisions, which is clearly not a valid merge and this version of 
reposurgeon avoids creating merge commits for cherry-picks.  
svnmerge-integrated specifies /trunk:1-151687,151691-152437, which would 
be a valid merge from trunk (there are no trunk revisions in the gap) but 
reposurgeon only looks at svnmerge-integrated if svn:mergeinfo is empty.  
I'll file a request that it take the union of revision ranges from the two 
properties, which ought to be easy to implement.  Once it recognises that 
commit as a merge from trunk, it will should automatically discard the 
other merge parents because it now avoids adding a merge parent that is an 
ancestor of another merge parent.

I.e. this is an artifact of someone having done a merge with svnmerge.py 
that brought in svn:mergeinfo from another branch where SVN's native merge 
tracking had been used, and should be straightforward to fix.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2019-12-18 21:55 ` Joseph Myers
@ 2019-12-19  0:36   ` Bernd Schmidt
  2019-12-19  0:58     ` Joseph Myers
  2019-12-19 16:29   ` Joseph Myers
  1 sibling, 1 reply; 54+ messages in thread
From: Bernd Schmidt @ 2019-12-19  0:36 UTC (permalink / raw)
  To: Joseph Myers, gcc; +Cc: esr

On 12/18/19 10:55 PM, Joseph Myers wrote:
> git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-3a.git

I cloned this one and started trying random things again.
The previous one had some strange-looking merge commits, but it sounded 
like that was a known issue, and indeed the ones I had seen were fixed 
in this new version.
I decided to write a small script to check whether there were any merge 
commits with more than two parents, and there are a few which have 
three. One commit seems to occur as a parent in all of these: 
422854db0e8605867e0834035aa2b1da1b71cbfb. An example is 
b743e467e43e6211f2c2537f1f07bbceb4d3aa61, apparently from spu-4_5-branch.

No idea whether there is an issue or whether this is worth looking at, 
but I figured I'd point it out at least.


Bernd

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2019-12-17 21:32 Joseph Myers
  2019-12-17 23:33 ` Bernd Schmidt
  2019-12-18 13:10 ` Jason Merrill
@ 2019-12-18 21:55 ` Joseph Myers
  2019-12-19  0:36   ` Bernd Schmidt
  2019-12-19 16:29   ` Joseph Myers
  2 siblings, 2 replies; 54+ messages in thread
From: Joseph Myers @ 2019-12-18 21:55 UTC (permalink / raw)
  To: gcc; +Cc: esr

On Tue, 17 Dec 2019, Joseph Myers wrote:

> I've made test conversions of the GCC repository with reposurgeon
> available (gcc.gnu.org / sourceware.org account required to access
> these git+ssh repositories, it doesn't need to be one in the gcc group
> or to have shell access).  More information about the repositories,
> conversion choices made and known issues is given below, and, as noted
> there, I'm running another conversion now with fixes for some of those
> issues and the remaining listed issues not fixed in that conversion
> are being actively worked on.
> 
> git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-1a.git
> git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-1b.git

There are now four more repositories available.

git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-2a.git
git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-2b.git
git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-3a.git
git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-3b.git

The 2a and 2b repositories are similar to 1a and 1b, but have fixes to 
issues 4 and 5 I listed (they include attribution extraction from 
ChangeLog.<branch> files and all of Richard's commit message 
improvements).  The 3a and 3b repositories have further improvements to 
the conversion machinery.  Verification shows that 3a and 3b have execute 
permissions on files exactly the same as in SVN, at all (non-deleted) 
branch tips and tags.  .cvsignore files are present as requested.  There 
are mergeinfo improvements to avoid spuriously marking cherry-picks as 
merge commits and to avoid having excessive numbers of merge parents in 
some cases, but the mergeinfo improvements should be considered a work in 
progress; further improvements have since been implemented in reposurgeon 
since then to allow many more merges to result in git merge commits 
without false positives for cherry-picks, so will be in the next trial 
conversion after these ones.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2019-12-18 13:10 ` Jason Merrill
@ 2019-12-18 18:16   ` Joseph Myers
  2019-12-19  5:50     ` Jason Merrill
  0 siblings, 1 reply; 54+ messages in thread
From: Joseph Myers @ 2019-12-18 18:16 UTC (permalink / raw)
  To: Jason Merrill; +Cc: gcc Mailing List, Eric Raymond

On Wed, 18 Dec 2019, Jason Merrill wrote:

> On Tue, Dec 17, 2019 at 4:39 PM Joseph Myers <joseph@codesourcery.com>
> wrote:
> 
> > Points for consideration:
> >
> > 1. Do we want some kind of rearrangement of refs as in the 1b
> > repository or not?
> >
> 
> Maybe?  How much space does that save in a clone?  How much work does a
> partial clone add on the server, since the server needs to pack up the
> objects for the partial clone rather than just transmitting its own packs?

I haven't measured work on the server, and timing individual clones is 
liable to a lot of variation from variable load there, but for a single 
clone --mirror of the 1b repository (so all refs, including refs/deleted/) 
I got

real    13m16.473s
user    16m45.429s
sys     0m33.901s

and 1360 MB objects directory, but for a clone without --mirror (so only a 
limited subset of refs and the server needing to build a pack)

real    15m5.554s
user    12m11.771s
sys     0m26.914s

and 950 MB objects directory.  Adding the objects from the existing 
git-svn mirror (presumably also under refs not fetched by default) 
increases repository size by about 300 MB, based on a previous test of 
doing that (most blob and tree objects will be shared between the two 
versions of the history, but all the commit objects are separate).

> > 3. Where an attribution comes from an author map rather than a
> > ChangeLog file, do we wish to use the existing author map or do people
> > prefer using names from that map but with @gcc.gnu.org addresses (and
> > @gnu.org for usernames that only committed in the gcc2 period)?
> 
>  I lean toward the latter.

I'll plan to change the author map to default to @gcc.gnu.org and @gnu.org 
addresses.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2019-12-18  3:28     ` Joseph Myers
@ 2019-12-18 14:36       ` Joseph Myers
  0 siblings, 0 replies; 54+ messages in thread
From: Joseph Myers @ 2019-12-18 14:36 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: gcc, esr

On Wed, 18 Dec 2019, Joseph Myers wrote:

> On Wed, 18 Dec 2019, Joseph Myers wrote:
> 
> > On Wed, 18 Dec 2019, Bernd Schmidt wrote:
> > 
> > > On 12/17/19 10:32 PM, Joseph Myers wrote:
> > > > git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-1a.git
> > > 
> > > It seems that permission bits are not reproduced entirely correctly. For
> > > example, contrib/check_GNU_style_lib.py went from -rwxr-xr-x in svn (and the
> > > git-svn repository) to -rw-r--r-- in this new git repository.
> > 
> > Thanks, I've reduced this to a minimal test for Eric, so hopefully it 
> > should be resolved soon.
> 
> I believe I have a fix for this; I'm now running a full GCC conversion 
> with that fix (and with some fixes related to merge commits).

The full validation of all branch tips and tags is still running, but I've 
confirmed that this conversion now has execute permissions on master 
exactly the same as in SVN trunk, that the spurious merges I previously 
saw have disappeared and that .cvsignore now appears in the history as 
requested.  I'll upload new repository conversions later today.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2019-12-17 21:32 Joseph Myers
  2019-12-17 23:33 ` Bernd Schmidt
@ 2019-12-18 13:10 ` Jason Merrill
  2019-12-18 18:16   ` Joseph Myers
  2019-12-18 21:55 ` Joseph Myers
  2 siblings, 1 reply; 54+ messages in thread
From: Jason Merrill @ 2019-12-18 13:10 UTC (permalink / raw)
  To: Joseph Myers; +Cc: gcc Mailing List, Eric Raymond

On Tue, Dec 17, 2019 at 4:39 PM Joseph Myers <joseph@codesourcery.com>
wrote:

> Points for consideration:
>
> 1. Do we want some kind of rearrangement of refs as in the 1b
> repository or not?
>

Maybe?  How much space does that save in a clone?  How much work does a
partial clone add on the server, since the server needs to pack up the
objects for the partial clone rather than just transmitting its own packs?


> 2. Should the final converted repository contain refs/deleted/ refs or
> not?
>

I think not.


> 3. Where an attribution comes from an author map rather than a
> ChangeLog file, do we wish to use the existing author map or do people
> prefer using names from that map but with @gcc.gnu.org addresses (and
> @gnu.org for usernames that only committed in the gcc2 period)?
>

 I lean toward the latter.

Jason

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2019-12-18  0:52   ` Joseph Myers
@ 2019-12-18  3:28     ` Joseph Myers
  2019-12-18 14:36       ` Joseph Myers
  0 siblings, 1 reply; 54+ messages in thread
From: Joseph Myers @ 2019-12-18  3:28 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: gcc, esr

On Wed, 18 Dec 2019, Joseph Myers wrote:

> On Wed, 18 Dec 2019, Bernd Schmidt wrote:
> 
> > On 12/17/19 10:32 PM, Joseph Myers wrote:
> > > git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-1a.git
> > 
> > It seems that permission bits are not reproduced entirely correctly. For
> > example, contrib/check_GNU_style_lib.py went from -rwxr-xr-x in svn (and the
> > git-svn repository) to -rw-r--r-- in this new git repository.
> 
> Thanks, I've reduced this to a minimal test for Eric, so hopefully it 
> should be resolved soon.

I believe I have a fix for this; I'm now running a full GCC conversion 
with that fix (and with some fixes related to merge commits).

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2019-12-17 23:33 ` Bernd Schmidt
  2019-12-18  0:51   ` Eric S. Raymond
@ 2019-12-18  0:52   ` Joseph Myers
  2019-12-18  3:28     ` Joseph Myers
  1 sibling, 1 reply; 54+ messages in thread
From: Joseph Myers @ 2019-12-18  0:52 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: gcc, esr

On Wed, 18 Dec 2019, Bernd Schmidt wrote:

> On 12/17/19 10:32 PM, Joseph Myers wrote:
> > git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-1a.git
> 
> It seems that permission bits are not reproduced entirely correctly. For
> example, contrib/check_GNU_style_lib.py went from -rwxr-xr-x in svn (and the
> git-svn repository) to -rw-r--r-- in this new git repository.

Thanks, I've reduced this to a minimal test for Eric, so hopefully it 
should be resolved soon.  I've also implemented comparison of execute 
permissions in my script checking branch tips, so future conversions will 
be fully checked in that regard (including the one I'm running right now, 
although since that one is with unfixed reposurgeon the comparison will 
just show exactly where there are problems - which should help show 
whether there are any cases of permission issues different from my minimal 
test; all cases I see on trunk / master match the pattern of that minimal 
test).

> I vote for including .cvsignore files. Their absence makes diff comparisons of
> "git ls-tree" on specific revisions needlessly noisy.

This has been implemented, so future conversion runs (again, not the one 
running right now, which just addresses issues 4 and 5 from my list and is 
based on r279452 rather than r279402 so includes two days' more changes 
from SVN) will have .cvsignore files included in the git repository.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2019-12-17 23:33 ` Bernd Schmidt
@ 2019-12-18  0:51   ` Eric S. Raymond
  2019-12-18  0:52   ` Joseph Myers
  1 sibling, 0 replies; 54+ messages in thread
From: Eric S. Raymond @ 2019-12-18  0:51 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: Joseph Myers, gcc

Bernd Schmidt <bernds_cb1@t-online.de>:
> I vote for including .cvsignore files. Their absence makes diff comparisons
> of "git ls-tree" on specific revisions needlessly noisy.

A few minutes ago I implmemted and pushed a --cvsignores read option
for Subversion dumps.  That should do what you eant.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Test GCC conversion with reposurgeon available
  2019-12-17 21:32 Joseph Myers
@ 2019-12-17 23:33 ` Bernd Schmidt
  2019-12-18  0:51   ` Eric S. Raymond
  2019-12-18  0:52   ` Joseph Myers
  2019-12-18 13:10 ` Jason Merrill
  2019-12-18 21:55 ` Joseph Myers
  2 siblings, 2 replies; 54+ messages in thread
From: Bernd Schmidt @ 2019-12-17 23:33 UTC (permalink / raw)
  To: Joseph Myers, gcc; +Cc: esr

On 12/17/19 10:32 PM, Joseph Myers wrote:
> git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-1a.git

It seems that permission bits are not reproduced entirely correctly. For 
example, contrib/check_GNU_style_lib.py went from -rwxr-xr-x in svn (and 
the git-svn repository) to -rw-r--r-- in this new git repository.

I vote for including .cvsignore files. Their absence makes diff 
comparisons of "git ls-tree" on specific revisions needlessly noisy.


Bernd

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Test GCC conversion with reposurgeon available
@ 2019-12-17 21:32 Joseph Myers
  2019-12-17 23:33 ` Bernd Schmidt
                   ` (2 more replies)
  0 siblings, 3 replies; 54+ messages in thread
From: Joseph Myers @ 2019-12-17 21:32 UTC (permalink / raw)
  To: gcc; +Cc: esr

I've made test conversions of the GCC repository with reposurgeon
available (gcc.gnu.org / sourceware.org account required to access
these git+ssh repositories, it doesn't need to be one in the gcc group
or to have shell access).  More information about the repositories,
conversion choices made and known issues is given below, and, as noted
there, I'm running another conversion now with fixes for some of those
issues and the remaining listed issues not fixed in that conversion
are being actively worked on.

git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-1a.git
git+ssh://gcc.gnu.org/home/gccadmin/gcc-reposurgeon-1b.git

The two repositories have exactly the same objects (thus, exactly the
same commit graph).  The only difference is that the 1a conversion has
branches and tags named the same as in SVN (or as similarly as
possible; tags in branches/st/tags/ in SVN become tags in
refs/tags/st/ in git), whereas the 1b conversion has refs rearranged
as suggested by Richard (meaning most are not fetched by default, so
you may wish to clone with --mirror to inspect them more closely).  We
can of course do a different rearrangement if desired.

The repositories also include refs/deleted refs for each commit that
deleted a tag or branch in SVN (to be precise, the ref points to a
commit deleting all the tag or branch contents, so preserving the
original commit message for the deletion; its parent is thus the final
state of the tag or branch before deletion).  We may or may not want
these in the final conversion, but it seems useful to have them at
this point for verification purposes (in particular, I intend to
implement a check that the final state of each tag or branch before
deletion is correct, as a further check that the conversion machinery
is working correctly).

The repositories don't include refs for the version of history from
the old git-svn mirror, but I have a script to add them (in
refs/git-old/ and refs/git-svn-old/) for the benefit of people wishing
to interpret old commit hashes after the conversion and to make things
more convenient for people wishing to rebase active git-only branches
onto the new version of the history.  The script is independent of
reposurgeon; it's just a single "git fetch" command (which should be
followed by "git gc --aggressive").

The repositories include all the non-deleted branches and tags in the
SVN repository (and, outside refs/deleted/, that is the exact set of
branches and tags present).  For this purpose, the file
branches/st/README in the SVN repository is considered to have its own
branch.  reposurgeon generates a "root" branch for commits to paths
not part of any branch; this is not included in these repositories
(has been deleted at the git level) because I don't believe it
contains anything plausibly relevant in git.  The commits to
branches/st/README were moved to their own branch, as noted; all other
commits that end up in "root" are either commits wrongly creating a
branch or tag at top level rather than /branches or /tags, commits
deleting such branches or tags created in the wrong place, or changes
to the SVN /hooks directory.

As far as I know, all issues affecting commit tree contents have been
fixed, as have some previously noted issues with some merge commits
having too many parents, and incorrect attributions seen in an earlier
conversion of Richard's.  Tree contents are verified correct at every
non-deleted branch tip and tag (I intend to do such validation for
deleted branches and tags as well, but haven't yet implemented it).
For comparisons, the following methodology applies: empty directories
are removed from the SVN checkout, because git doesn't store empty
directories; .cvsignore files are excluded from the comparison, since
reposurgeon doesn't include them (but if people want them in the git
history, they could easily be included); .gitignore files are excluded
from the comparison, since reposurgeon generates one based on
svn:ignore properties (or SVN defaults, in the absence of such
properties) where the repository doesn't have one checked in (where
there *is* a .gitignore file checked into SVN, it's preferred over the
auto-generated one); cases of SVN keyword expansion are excluded
manually (only two branches have files with SVN keyword expansion
enabled).

Every branch with SVN ancestry based on the first commit of /trunk has
first-parent ancestry in git going back to that commit, as expected.
This includes libstdcxx_so_7-2-branch, which was created via creating
the directory for the branch and then copying only the libstdc++-v3
subdirectory from trunk rather than directly copying the whole of
/trunk to /branches/libstdcxx_so_7-2-branch; reposurgeon has detected
that case automatically and created an appropriate parent link to the
relevant trunk commit.

Some parts of Richard's commit message improvements are present, but
most aren't because of an issue accessing Bugzilla (which also
affected some of the improvements not involving accessing Bugzilla, as
the script terminated early).  This should be fixed for my next
conversion run.  As discussed, Richard's improvements only add new
summary lines, with the original commit message following them.


Known issues (all either already fixed or understood and currently
being worked on):

1. Some cherry-picks are showing up as merges (this is the only issue
I could find in my checks, manual and automated, that affects the
commit graph; I couldn't find any issues affecting tree contents,
first-parent ancestry or the set of refs present).  Being worked on by
Julien "_FrnchFrgg_" RIVAUD.

2. Branch creation or recreation commits have attribution taken from
some ChangeLog file in the branch when it should come from the SVN
committer.  Being worked on by Eric S. Raymond.

3. There are still some merge commits with too many parents, although
the cases Richard found have all been fixed (and all those parents in
the cases I found are genuinely ancestors of the merge commit in
question, so it's essentially a cosmetic issue that there are some
that are redundant - it won't affect anything in default "git log"
output other than the "Merge:" line, for example, as "git log" orders
by commit timestamp by default).  Being worked on by Julien
"_FrnchFrgg_" RIVAUD.

4. Only files called ChangeLog are used to extract attributions, not
ChangeLog.<branch> (fixed for my next conversion run, currently
running at the git-fast-import stage).

5. Most of Richard's commit message improvements aren't present (fixed
for my next conversion run, currently running at the git-fast-import
stage).


Points for consideration:

1. Do we want some kind of rearrangement of refs as in the 1b
repository or not?

2. Should the final converted repository contain refs/deleted/ refs or
not?

3. Where an attribution comes from an author map rather than a
ChangeLog file, do we wish to use the existing author map or do people
prefer using names from that map but with @gcc.gnu.org addresses (and
@gnu.org for usernames that only committed in the gcc2 period)?

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2020-01-09 21:57 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-06 22:09 Test GCC conversion with reposurgeon available Loren James Rittle
2020-01-07  9:35 ` Richard Earnshaw (lists)
2020-01-07 15:53   ` Loren James Rittle
  -- strict thread matches above, loose matches on Subject: below --
2019-12-17 21:32 Joseph Myers
2019-12-17 23:33 ` Bernd Schmidt
2019-12-18  0:51   ` Eric S. Raymond
2019-12-18  0:52   ` Joseph Myers
2019-12-18  3:28     ` Joseph Myers
2019-12-18 14:36       ` Joseph Myers
2019-12-18 13:10 ` Jason Merrill
2019-12-18 18:16   ` Joseph Myers
2019-12-19  5:50     ` Jason Merrill
2019-12-19 15:55       ` Joseph Myers
2019-12-18 21:55 ` Joseph Myers
2019-12-19  0:36   ` Bernd Schmidt
2019-12-19  0:58     ` Joseph Myers
2019-12-19 16:29   ` Joseph Myers
2019-12-22 13:57     ` Joseph Myers
2019-12-23 17:27       ` Roman Zhuykov
2019-12-24 11:50         ` Joseph Myers
2019-12-24 15:55           ` Segher Boessenkool
2019-12-24 17:17             ` Joseph Myers
2019-12-24 18:14               ` Segher Boessenkool
2019-12-25 11:03                 ` Roman Zhuykov
2019-12-25 11:20                   ` Joseph Myers
2019-12-25 12:23                     ` Eric S. Raymond
2019-12-25 14:32                   ` Andreas Schwab
2019-12-25 14:41                     ` Joseph Myers
2019-12-25 15:10                       ` Andreas Schwab
2019-12-25 15:36                         ` Joseph Myers
2019-12-25 17:15                           ` Segher Boessenkool
2019-12-25 19:33                             ` Eric S. Raymond
2019-12-26 21:03                               ` Vincent Lefevre
2019-12-26 21:31                                 ` Eric S. Raymond
2019-12-26 22:25                                   ` Toon Moene
2019-12-26 22:32                                     ` Eric S. Raymond
2019-12-27 14:40                                       ` Segher Boessenkool
2019-12-26 22:57                                   ` Vincent Lefevre
2019-12-26 23:38                                     ` Eric S. Raymond
2019-12-25 19:40                           ` Eric S. Raymond
2019-12-27 21:29                           ` Andreas Schwab
2019-12-27 21:43                             ` Joseph Myers
2019-12-25 19:19                     ` Eric S. Raymond
2019-12-27 21:30                       ` Andreas Schwab
2019-12-28  2:43                         ` Eric S. Raymond
2019-12-27 14:37                   ` Richard Earnshaw
2019-12-24 10:57       ` Maxim Kuvyrkov
2019-12-28 16:30       ` Joseph Myers
2020-01-03 12:38         ` Joseph Myers
2020-01-06 23:58           ` Andrew Pinski
2020-01-07  0:30             ` Joseph Myers
2020-01-07  0:44             ` Richard Earnshaw
2020-01-09 12:22           ` Joseph Myers
2020-01-09 21:57             ` Joseph Myers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).