Re: on reputation and lines and putting things places (Re: gcc branches?)

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Re: on reputation and lines and putting things places (Re: gcc branches?)
@ 2002-12-08  7:13 Robert Dewar
  2002-12-08 14:18 ` source mgt. requirements solicitation Tom Lord
  0 siblings, 1 reply; 60+ messages in thread
From: Robert Dewar @ 2002-12-08  7:13 UTC (permalink / raw)
  To: dewar, lord; +Cc: gcc

> That's pretty much what I'd guessed.  I'll reiterate: you go girl!
> That's cool.  I admire you.  Human scaled, competent, successful:
> neat!  Sheesh.  Are you just flipping out over my use of the word
> "dinky"?

No, it is just the entire style of your presentation.

> I've started to believe that there is no variation on advocacy that
> could possibly succeed given presumptions such as you have exhibited.
> It is interesting to try to trace those presumptions back to their
> origins (*cough*cygnus).  Yet another "bash on Tom" day, I guess.

I would tend to agree if it is you doing the advocacy. My best advice,
find someone who knows how to approach other people successfully.

> I don't know much at all about ACT

So I see :-)

> I'm "not talking to" ACT because, at your scale, my R&D funding needs
> are too big for you and not central enough to your mission.

Well how do you know? Given the previous quote?
In fact CM and revision control systems are quite critical to many of our
customers. We have several customers managing systems with tens of thousands
of files and millions of lines of code. Remember that the niche Ada occupies
is large scale mission critical systems.

Perhaps you are missing an opportunity here, though I must say the phrase
"my R&D" funding needs is worryingly personal, and as I said earlier, if
the intent of this thread was to encourage people to look at arch, it
has not worked with me.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* source mgt. requirements solicitation
  2002-12-08  7:13 on reputation and lines and putting things places (Re: gcc branches?) Robert Dewar
@ 2002-12-08 14:18 ` Tom Lord
  2002-12-08 14:56   ` DJ Delorie
                     ` (2 more replies)
  0 siblings, 3 replies; 60+ messages in thread
From: Tom Lord @ 2002-12-08 14:18 UTC (permalink / raw)
  To: dewar; +Cc: gcc



       dewar:
       > No, it is just the entire style of your presentation.

Ok, here's a patch:


       > In fact CM and revision control systems are quite critical to
       > many of our customers. We have several customers managing
       > systems with tens of thousands of files and millions of lines
       > of code.
       [...]
       > Perhaps you are missing an opportunity here
       [...]
       > if the intent of this thread was to encourage people to look
       > at arch, it has not worked with me.

I'm inexperienced in sales, but from what I read, the right thing here
is for me to solicit from you much more information about what you
think your (or your customers) needs are -- then if `arch' fits, I can
state why in your terms (and if not, thank you for your time and take
my leave).  Ok?

So, I'm listening.  For both the GCC project and ACT's customers, what
do you (and others on this list) initially think is important in
source management technology, especially, but not limited to revision
control and adjacent tools?

I said "initially" because I'm wondering how to proceed if you list
requirements that I think are buggy in one way or another.  Is it
"good style" to point that out if it occurs?

I encourage you to spend a little time answering these questions.
There are currently three of four serious revision control projects in
the free software world (OpenCM, svn, arch, and metacvs), all in the
later stages of initial development.  A lot of people, besides just
me, can probably benefit from your (and other GCC developers') input
-- and your input can help make sure you get better tools down the
road.

I have some observations that I hope your answers might begin to
address.  These are observations of facts I think are relevant; I'm
assuming it's "good style" to stop there rather than to try to turn
these into leading questions.  These observations include (in no
particular order):

	1) There are frequent reports on this list of glitches with
	   the current CVS repository.

	2) GCC, more than many projects, relies on a distributed
           testing effort, which mostly applies to the HEAD revision
	   and to release candidates.  Most of this testing is done
	   by hand.

	3) Judging by the messages on this list, there is some tension
	   between the release cycle and feature development -- some
	   issues around what is merged when, and around the impact of
	   freezes.

	4) GCC, more than many projects, makes use of a formal review
           process for incoming patches.

	5) Mark and the CodeSourcery crew seem to do a lot of fairly
	   mechanical work by hand to operate the release cycle.

	6) People often do some archaeology to understand how
           performance and quality of generated code are evolving:
	   they work up experiments comparing older releases to newer,
	   and comparing various combinations of patches.

	7) Questions about which patches relate to which issues in the
	   issue database are fairly common.

	8) There have been a few controversies from GCC "customers"
           arising out of whether they can use the latest release, or
	   whether they should release non-official versions.

	9) Distributed testing occurs mostly on the HEAD -- which
           means that the HEAD breaks on various targets, fairly
           frequently.

	10) The utility of the existing revision control set up to 
	    people who lack write access is distinctly less than 
	    the utility to people with write access.

	11) Some efforts, such as overhauling the build process, will
	    probably benefit from a switch to rev ctl. systems that
	    support tree rearrangements.

	12) The GCC project is heavily invested in a particular
            testing framework.

	13) GCC, more than many projects, makes very heavy use of 
	    development on branches.

-t

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-08 14:18 ` source mgt. requirements solicitation Tom Lord
@ 2002-12-08 14:56   ` DJ Delorie
  2002-12-08 15:02     ` David S. Miller
  2002-12-08 15:11     ` Bruce Stephens
  2002-12-08 16:09   ` Phil Edwards
  2002-12-08 18:32   ` Joseph S. Myers
  2 siblings, 2 replies; 60+ messages in thread
From: DJ Delorie @ 2002-12-08 14:56 UTC (permalink / raw)
  To: lord; +Cc: dewar, gcc


> There are currently three of four serious revision control projects in
> the free software world (OpenCM, svn, arch, and metacvs),

You forgot to list RCS and CVS.

> 	2) GCC, more than many projects, relies on a distributed
>          testing effort, which mostly applies to the HEAD revision
> 	   and to release candidates.  Most of this testing is done
> 	   by hand.

All my testing is automated.

> 	3) Judging by the messages on this list, there is some tension
> 	   between the release cycle and feature development -- some
> 	   issues around what is merged when, and around the impact of
> 	   freezes.

I don't see how any revision management system can fix this.  This is
a people problem.

> 	9) Distributed testing occurs mostly on the HEAD -- which
>          means that the HEAD breaks on various targets, fairly
>          frequently.

No, more testing on head means that head *works* more often.  The
other branches are just as broken, we just don't know about it yet.

> 	10) The utility of the existing revision control set up to 
> 	    people who lack write access is distinctly less than 
> 	    the utility to people with write access.

This is a good thing.  We don't want them to be able to do all the
things write-access people can do.  That's the whole point.

> 	11) Some efforts, such as overhauling the build process, will
> 	    probably benefit from a switch to rev ctl. systems that
> 	    support tree rearrangements.

Like CVS?  It supports trees.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-08 14:56   ` DJ Delorie
@ 2002-12-08 15:02     ` David S. Miller
  2002-12-08 15:45       ` Bruce Stephens
  2002-12-08 15:11     ` Bruce Stephens
  1 sibling, 1 reply; 60+ messages in thread
From: David S. Miller @ 2002-12-08 15:02 UTC (permalink / raw)
  To: dj; +Cc: lord, dewar, gcc

I think if one is going to try and promote a source management system,
I'm pretty sure performance alone would be enough to convince a lot of
people.

After using bitkeeper for just a week or two, I nearly stopped doing
much GCC development simply because CVS is such a dinosaur.  It's like
driving a model-T on a US interstate highway or the autobahn. It's
truly that painful to use.

So if arch can provide the same kind of improvement, promote that part
of it.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-08 15:02     ` David S. Miller
@ 2002-12-08 15:45       ` Bruce Stephens
  2002-12-08 16:52         ` David S. Miller
  0 siblings, 1 reply; 60+ messages in thread
From: Bruce Stephens @ 2002-12-08 15:45 UTC (permalink / raw)
  To: gcc

"David S. Miller" <davem@redhat.com> writes:

> I think if one is going to try and promote a source management system,
> I'm pretty sure performance alone would be enough to convince a lot of
> people.
>
> After using bitkeeper for just a week or two, I nearly stopped doing
> much GCC development simply because CVS is such a dinosaur.  It's like
> driving a model-T on a US interstate highway or the autobahn. It's
> truly that painful to use.
>
> So if arch can provide the same kind of improvement, promote that part
> of it.

I think it can't, at the moment.

However, that's an interesting point: what do you do with CVS and with
BitKeeper?  What operations are performance-critical for you?

(My intuition is that arch has concentrated on operations which are
relatively uncommon, such as branch merging and the like, relying on a
revision library for operations which seem to me more common---like
"cvs log", "cvs diff", and the like (or rather their moral equivalents
in a configuration based CM).  The catch is that the revision library
is expensive in disk terms---arguably not a problem, since disk space
is cheap, but even so.  But my intuition may be wrong, so what about
CVS seems slow to you, compared with BitKeeper?)

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-08 15:45       ` Bruce Stephens
@ 2002-12-08 16:52         ` David S. Miller
  0 siblings, 0 replies; 60+ messages in thread
From: David S. Miller @ 2002-12-08 16:52 UTC (permalink / raw)
  To: bruce; +Cc: gcc

   From: Bruce Stephens <bruce@cenderis.demon.co.uk>
   Date: Sun, 08 Dec 2002 23:11:14 +0000
   
   However, that's an interesting point: what do you do with CVS and with
   BitKeeper?  What operations are performance-critical for you?

I think CVS's weak performance points are so well understood
by other people that they can comment as good as or better
than me :-)

Operations on a branch are painful, so someone can start there :-)

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-08 14:56   ` DJ Delorie
  2002-12-08 15:02     ` David S. Miller
@ 2002-12-08 15:11     ` Bruce Stephens
  2002-12-08 16:24       ` Joseph S. Myers
  1 sibling, 1 reply; 60+ messages in thread
From: Bruce Stephens @ 2002-12-08 15:11 UTC (permalink / raw)
  To: gcc

DJ Delorie <dj@redhat.com> writes:

[...]

>> 	10) The utility of the existing revision control set up to 
>> 	    people who lack write access is distinctly less than 
>> 	    the utility to people with write access.
>
> This is a good thing.  We don't want them to be able to do all the
> things write-access people can do.  That's the whole point.

Not on the central repository, no.  But it might be that people
(people without write access to the main repository) could usefully
keep branches on their own repository (perhaps merging the patches in
at some stage).  With CVS, that's not possible, but with a distributed
CM system it would be.

>> 	11) Some efforts, such as overhauling the build process, will
>> 	    probably benefit from a switch to rev ctl. systems that
>> 	    support tree rearrangements.
>
> Like CVS?  It supports trees.

It doesn't handle renaming files or directories.  There are ways to do
both, but you lose something, whatever you choose to do.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-08 15:11     ` Bruce Stephens
@ 2002-12-08 16:24       ` Joseph S. Myers
  2002-12-08 16:47         ` Tom Lord
  0 siblings, 1 reply; 60+ messages in thread
From: Joseph S. Myers @ 2002-12-08 16:24 UTC (permalink / raw)
  To: Bruce Stephens; +Cc: gcc

On Sun, 8 Dec 2002, Bruce Stephens wrote:

> Not on the central repository, no.  But it might be that people
> (people without write access to the main repository) could usefully
> keep branches on their own repository (perhaps merging the patches in
> at some stage).  With CVS, that's not possible, but with a distributed
> CM system it would be.

Distributed CM could be a mixed blessing.  Sometimes when people merge
development from a branch to mainline the mainline ChangeLog such says
"See ChangeLog.foobar on foobar-branch for details." (though I don't think
this is a proper form of ChangeLog for such changes, the Changelog should
describe the changes made to mainline following the usual standards).  If
the branch sat on someone's machine elsewhere, there's then a lot of
potential for losing this information later if the machine goes away,
fails, etc. - whereas the main repository is at least rsyncable and
rsynced by various people.

(Such problems could be avoided if there were a mechanism by which such
branches of interest - probably including any discussed on the list, could
be "adopted" into the main repository, so that their history (maintained
on some other machine) is regularly made available from the main rsyncable 
repository and isn't lost if the originating machine goes away.  This 
applies even to branches that don't get merged to mainline (superseded by 
other branches, etc.) but which are of relevance to historical discussions 
on the lists.)

There is one notable problem with CVS's handling of users without write
access: they can't do "cvs add" to generate diffs with added files, though
they can fake its local effects.  I don't know whether svn fixes this.

-- 
Joseph S. Myers
jsm28@cam.ac.uk

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-08 16:24       ` Joseph S. Myers
@ 2002-12-08 16:47         ` Tom Lord
  2002-12-08 22:20           ` Craig Rodrigues
  0 siblings, 1 reply; 60+ messages in thread
From: Tom Lord @ 2002-12-08 16:47 UTC (permalink / raw)
  To: gcc

Thanks for the replies so far.   These are helpful.

My intention is to read these over, take lots of notes, and make a
succinct-as-possible, coalesced reply.  I'll also (so far) reply
individually to shebs' issue with merging (since it sounds like an
interesting and relevant technical problem).  If there's some other
issue you'd like to see pulled out from a coallesced reply, please say
so explicitly.

One quick request: someone said "Hey, testing is already automated."
Can I please see a slight elaboration on the form and function of that
automation?  (I have some idea, but maybe there's something I've
overlooked.  What I _think_ I know already is that `make test' works,
and that there's some infrastructure for mailing in `make test' output
and having it show up on a web site.  Presumably individual testers
have their own scripts for that.  I'm not aware that there is any
infrastructure for easily testing arbitrary combinations of patches,
but one comment implied that there is.  Someone mentioned QMtest,
which can really tighten-up that automation -- but last I heard,
prospects for its adoption by GCC were slim: has that changed?)

Still listening,
-t

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-08 16:47         ` Tom Lord
@ 2002-12-08 22:20           ` Craig Rodrigues
  0 siblings, 0 replies; 60+ messages in thread
From: Craig Rodrigues @ 2002-12-08 22:20 UTC (permalink / raw)
  To: Tom Lord; +Cc: gcc

On Sun, Dec 08, 2002 at 04:22:22PM -0800, Tom Lord wrote:
> One quick request: someone said "Hey, testing is already automated."
> Can I please see a slight elaboration on the form and function of that
> automation?  

As far as I can tell, there are are number of people who run
daily (or frequent) builds of GCC on a few platforms.  They
use the output of "make test", which kicks off some tests which
use the DejaGNU testing framework, and post their output
to the gcc-testresults mailing list:

http://gcc.gnu.org/ml/gcc-testresults/

CodeSourcery has been working on converting the GCC testsutie
over to QMTest:
http://gcc.gnu.org/ml/gcc/2002-05/msg01978.html

While the existing GCC testing process has its benefits, I don't
think it is perfect.  It would be great if someone had some 
positive ideas towards improving the GCC testing process.

To give you some ideas of some of the problems in the current
process, Wolfgang Banerth has informed me that he has
identified 71 current C++ regressions from GCC 3.2 in the mainline 
branch of GCC, based on reading reports in GNATS.
Granted, many of these regressions might be related and duplicates, but
still, that is quite a number of regressions to track down and fix.
I'd be very interested in any ideas which could improve this process.

-- 
Craig Rodrigues        
http://www.gis.net/~craigr    
rodrigc@attbi.com

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-08 14:18 ` source mgt. requirements solicitation Tom Lord
  2002-12-08 14:56   ` DJ Delorie
@ 2002-12-08 16:09   ` Phil Edwards
  2002-12-08 19:13     ` Zack Weinberg
  2002-12-09 15:10     ` Walter Landry
  2002-12-08 18:32   ` Joseph S. Myers
  2 siblings, 2 replies; 60+ messages in thread
From: Phil Edwards @ 2002-12-08 16:09 UTC (permalink / raw)
  To: Tom Lord; +Cc: dewar, gcc

On Sun, Dec 08, 2002 at 02:06:31PM -0800, Tom Lord wrote:
> I said "initially" because I'm wondering how to proceed if you list
> requirements that I think are buggy in one way or another.  Is it
> "good style" to point that out if it occurs?

It's more likely that they understand the requirements better than you do,
so it would be /better/ style if you said, "could you elaborate on this,
here are my questions," rather than, "no, /your/ requirements are buggy."


> 	1) There are frequent reports on this list of glitches with
> 	   the current CVS repository.

IIRC, these have all been caused by non-CVS problems.  (E.g., disks filled
up, mail server getting hammed and DoS'ing the other services, etc.)


> 	2) GCC, more than many projects, relies on a distributed
>            testing effort, which mostly applies to the HEAD revision
> 	   and to release candidates.  Most of this testing is done
> 	   by hand.

I'll borrow one of your choice phrases and call this a bullshit rumor.
It's nearly all automated.


> 	3) Judging by the messages on this list, there is some tension
> 	   between the release cycle and feature development -- some
> 	   issues around what is merged when, and around the impact of
> 	   freezes.

Yes.  I don't see how the choice of revision control software makes a
difference here.  The limiting resource here is people-hours.


> 	4) GCC, more than many projects, makes use of a formal review
>            process for incoming patches.

Yes.

> 	5) Mark and the CodeSourcery crew seem to do a lot of fairly
> 	   mechanical work by hand to operate the release cycle.

Perhaps you haven't looked at contrib/* and maintainer-scripts/* lately?
Releases and weekly snapshots are all done with those.


> 	6) People often do some archaeology to understand how
>            performance and quality of generated code are evolving:
> 	   they work up experiments comparing older releases to newer,
> 	   and comparing various combinations of patches.

Yes.  This is also automated, e.g., Diego's SPEC2000 pages.

> 	7) Questions about which patches relate to which issues in the
> 	   issue database are fairly common.

*shrug*  When a patch is committed with an PR number in the log, the
issue database takes notice of it.  That's something that we added with
a CVS plugin.


> 	8) There have been a few controversies from GCC "customers"
>            arising out of whether they can use the latest release, or
> 	   whether they should release non-official versions.

Yes.  What does this have to do with revision control software?  Anybody
using open source can make this same decision.

> 	9) Distributed testing occurs mostly on the HEAD -- which
>            means that the HEAD breaks on various targets, fairly
>            frequently.

Uh, no.  Exactly backwards.

> 	10) The utility of the existing revision control set up to 
> 	    people who lack write access is distinctly less than 
> 	    the utility to people with write access.

Well, duh.

> 	11) Some efforts, such as overhauling the build process, will
> 	    probably benefit from a switch to rev ctl. systems that
> 	    support tree rearrangements.

Probably.

> 	12) The GCC project is heavily invested in a particular
>             testing framework.

Yes.  Well, that plus the new QMtest, which looks to bo far superior.

> 	13) GCC, more than many projects, makes very heavy use of 
> 	    development on branches.

Yes.


-- 
I would therefore like to posit that computing's central challenge, viz. "How
not to make a mess of it," has /not/ been met.
                                                 - Edsger Dijkstra, 1930-2002

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-08 16:09   ` Phil Edwards
@ 2002-12-08 19:13     ` Zack Weinberg
  2002-12-09 10:33       ` Phil Edwards
  2002-12-09 11:06       ` Joseph S. Myers
  2002-12-09 15:10     ` Walter Landry
  1 sibling, 2 replies; 60+ messages in thread
From: Zack Weinberg @ 2002-12-08 19:13 UTC (permalink / raw)
  To: Phil Edwards; +Cc: Tom Lord, dewar, gcc

Phil Edwards <phil@jaj.com> writes:

> On Sun, Dec 08, 2002 at 02:06:31PM -0800, Tom Lord wrote:
>> 	1) There are frequent reports on this list of glitches with
>> 	   the current CVS repository.
>
> IIRC, these have all been caused by non-CVS problems.  (E.g., disks
> filled up, mail server getting hammed and DoS'ing the other
> services, etc.)

There is one situation that used to come up a lot which is CVS's
fault: a 'cvs server' process dies without removing its lock files,
wedging that directory for everyone else until it's manually removed.
I believe this has been dealt with by some patches to the server plus
a cron job that looks for stale locks; however, a version control
system that could not get into a wedged state like that would be useful.

>> 	3) Judging by the messages on this list, there is some tension
>> 	   between the release cycle and feature development -- some
>> 	   issues around what is merged when, and around the impact of
>> 	   freezes.
>
> Yes.  I don't see how the choice of revision control software makes a
> difference here.  The limiting resource here is people-hours.

CVS makes working on branches quite difficult.  I suspect that a
system that made it easier would mean that people were a bit more
comfortable about doing development on branches for long periods of
time.

>> 	4) GCC, more than many projects, makes use of a formal review
>>            process for incoming patches.
> Yes.

This is a strength, but with a downside -- patches can and do get
lost.  We advise people to re-send patches at intervals, but some
sort of automated patch-tracker would probably be helpful.  I don't
think the version control system can help much here (but see below).

>> 	5) Mark and the CodeSourcery crew seem to do a lot of fairly
>> 	   mechanical work by hand to operate the release cycle.
>
> Perhaps you haven't looked at contrib/* and maintainer-scripts/* lately?
> Releases and weekly snapshots are all done with those.

I do a fair amount of by-hand work merging the trunk into the
basic-improvements-branch.  Some, but not all, of that work could be
facilitated with a better version control system.  See below.

>> 	11) Some efforts, such as overhauling the build process, will
>> 	    probably benefit from a switch to rev ctl. systems that
>> 	    support tree rearrangements.
>
> Probably.

I have several changes in mind which I have not done largely because
CVS lacks the ability to version renames.  To be specific: move cpplib
to the top level; move gcc/intl to the top level and sync it with the
version of that directory in the src repository; move the C front end
to a language subdirectory like the others; move the Ada runtime
library to the top level.

I'm not saying that I would definitely have done all of these changes
by now if we were using a version control system that handled renames;
only that the lack of rename support is a major barrier to them.

  * * *

I'm now going to list the requirements which I would place on a
replacement for CVS, in rough decreasing order of importance.  I
haven't done any research to back them up -- this is just off the top
of my head (but having thought about the issue quite a bit).

0. Must be at least as reliable and at least as portable as CVS.  GCC
   is a very large development effort.  We can't afford to lose
   contributors because their preferred platform is shut out, nor can
   we afford to lose work due to bugs, and we *especially* cannot risk
   a system which has not been audited for security exposures.  It
   would be relatively easy to give much stronger data integrity
   guarantees than CVS currently manages:

0a. All data stored in the repository is under an end-to-end
    checksum.  All data transmitted over the network is independently
    checksummed (yes, redundant with TCP-layer checksums).  CVS does
    no checksumming at all.

0b. Anonymous repository access is done under a user ID that has only
    OS-level read privileges on the repository's files.  This cannot
    be done with (unpatched) CVS.

0c. Remote write operations on the repository intrinsically require
    the use of a protocol which makes strong cryptographic integrity
    and authority guarantees.  CVS can be set up like this, but it's
    not built into the design.

0d. The data stored in the repository cannot be modified by
    unprivileged local users except by going through the version
    control system.  Presently I could take 'vi' to one of the ,v
    files in /cvs/gcc and break it thoroughly, or sneak something into
    the file content, and leave no trace.

1. Must be at least as fast as CVS for all operations, and should be
   substantially faster for all operations where CVS uses a braindead
   algorithm.  I would venture to guess that everyone's #1 complaint
   about CVS is the amount of time we waste waiting for it to complete
   this or that request.  To be more specific:

1a. Efficient network protocol.  Specifically, a network protocol that,
    for *all* operations, transmits a volume of data proportional --
    with a small constant! -- to the size of the diff involved, *not*
    the total size of all the files touched by the diff involved, as
    CVS does.

1b. Efficient tags and branches.  It should be possible to create
    either by creating *one* metadata record, rather than touching
    every single file in the repository.

1c. Efficient delta storage algorithm, such that checking in a change
    on the tip of a branch is not orders of magnitude slower than
    checking in a change on the tip of the trunk.  There are several
    sane ways to do this.

1d. Efficient method for extracting a logical change after the fact,
    no matter how many files it touched.  (Currently the easiest way
    to do this is: hunt through the gcc-cvs archive until you find the
    message describing the checkin you care about, then use wget on
    all of the per-file diff URLs in the list and glue them all
    together.  Slow, painful, doesn't always work.)

2. Should support this laundry list of features, none of which is
   known to CVS.  Most of them would be useful independent of the
   others, though there's not much point to 2b without 2a, nor 2e
   without 2d.

2a. Atomic application of a logical change that touches many files,
    possibly not all in the same directory. (This is commonly known as
    a "change set".)  One checkin log per change set is adequate.

2b. Ability to back out an entire change set just as atomically as it
    went in.

2c. Ability to rename a file, including the ability for a file to have
    different names on different branches.

2d. Automatically remember that a merge occurred from branch A to
    branch B; later, when a second merge occurs from A to B, don't
    apply those changes again.

2e. Understand the notion of a single-delta merge, either applying
    just one change from branch A to branch B, or removing just one
    change formerly on branch A ("subtractive merge").

2f. Perform conflict resolution by automatic formation of
    microbranches.

3. Should allow a user without commit privileges to generate a change
   set, making arbitrary changes to the repository (none of this "you
   can edit files and generate diffs but you can't add or delete
   files" nonsense), which can be applied by a user who does have
   commit privileges, and when the original author does an update
   he/she doesn't get spurious conflicts.

4. The repository's on-disk data should be stored in a highly compact
   format, to the maximum extent possible and consonant with being
   fast.  Being fast is much more important; however, GCC's CVS
   repository is ~800MB in size and compresses down to ~100MB.  You
   can do interesting things (like keep a copy of the entire
   repository on every developer's personal hard disk, as Bitkeeper
   does) with a 100MB repository that are not so practical when it's
   closer to a gigabyte.

5. Should have the ability to generate ChangeLog files automagically
   from the checkin comments.  (When merging to basic-improvements I
   normally spend more time fixing up the ChangeLogs than anything
   else.  Except maybe waiting for 'cvs tag' and 'cvs update -j...'.)

zw

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-08 19:13     ` Zack Weinberg
@ 2002-12-09 10:33       ` Phil Edwards
  2002-12-09 11:06       ` Joseph S. Myers
  1 sibling, 0 replies; 60+ messages in thread
From: Phil Edwards @ 2002-12-09 10:33 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: Tom Lord, dewar, gcc

On Sun, Dec 08, 2002 at 04:55:14PM -0800, Zack Weinberg wrote:

> I'm now going to list the requirements which I would place on a
> replacement for CVS, in rough decreasing order of importance.  I
> haven't done any research to back them up -- this is just off the top
> of my head (but having thought about the issue quite a bit).

With the exception of 2[def] and 5, I believe subversion does all of those.
I handle 5 with a wrapper script for checkins.


Phil

-- 
I would therefore like to posit that computing's central challenge, viz. "How
not to make a mess of it," has /not/ been met.
                                                 - Edsger Dijkstra, 1930-2002

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-08 19:13     ` Zack Weinberg
  2002-12-09 10:33       ` Phil Edwards
@ 2002-12-09 11:06       ` Joseph S. Myers
  2002-12-09  9:42         ` Zack Weinberg
  1 sibling, 1 reply; 60+ messages in thread
From: Joseph S. Myers @ 2002-12-09 11:06 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: gcc

On Sun, 8 Dec 2002, Zack Weinberg wrote:

> 0a. All data stored in the repository is under an end-to-end
>     checksum.  All data transmitted over the network is independently
>     checksummed (yes, redundant with TCP-layer checksums).  CVS does
>     no checksumming at all.

Doesn't SSH?

(And CVS does checksum checkouts/updates: if after applying a diff in cvs
update the file checksum doesn't match, it warns and regets the whole
file, which can indicate something was broken in the latest checkin to the
file (yielding a bogus delta).  This is however highly suboptimal - it
should be an error not a warning (with a warning sent to the repository
maintainers) and lots more checksumming should be done.  In addition:

0aa. Checksums stored in the repository format for all file revisions, 
deltas, log messages etc., with an easy way to verify them - to detect 
corruption early.)

> 5. Should have the ability to generate ChangeLog files automagically
>    from the checkin comments.  (When merging to basic-improvements I
>    normally spend more time fixing up the ChangeLogs than anything
>    else.  Except maybe waiting for 'cvs tag' and 'cvs update -j...'.)

The normal current practice here is for branch ChangeLogs to be kept in a
separate file, not the ChangeLogs that need merging from mainline.  (In
the case of BIB the branch ChangeLog then goes on the top of the mainline
one (with an overall "merge from BIB" comment) when the merge back to
mainline is done.  For branches developing new features a new ChangeLog
entry describing the overall logical effect of the branch changes, not the
details of how that state was reached, is more appropriate.)

-- 
Joseph S. Myers
jsm28@cam.ac.uk

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-09 11:06       ` Joseph S. Myers
@ 2002-12-09  9:42         ` Zack Weinberg
  2002-12-09 11:00           ` Jack Lloyd
  0 siblings, 1 reply; 60+ messages in thread
From: Zack Weinberg @ 2002-12-09  9:42 UTC (permalink / raw)
  To: Joseph S. Myers; +Cc: gcc

"Joseph S. Myers" <jsm28@cam.ac.uk> writes:

E> On Sun, 8 Dec 2002, Zack Weinberg wrote:
>
>> 0a. All data stored in the repository is under an end-to-end
>>     checksum.  All data transmitted over the network is independently
>>     checksummed (yes, redundant with TCP-layer checksums).  CVS does
>>     no checksumming at all.
>
> Doesn't SSH?

I assume it has to, since cryptography usually requires that.

> (And CVS does checksum checkouts/updates: if after applying a diff in cvs
> update the file checksum doesn't match, it warns and regets the whole
> file, which can indicate something was broken in the latest checkin to the
> file (yielding a bogus delta). 

I didn't know that.  But, as you say, it's not nearly enough.  (When
was the last time we got a block of binary zeroes in a ,v file and
nobody noticed for months?)

> 0aa. Checksums stored in the repository format for all file
> revisions, deltas, log messages etc., with an easy way to verify
> them - to detect corruption early.)

Worth pointing out that subversion doesn't do as much checksumming as
we'd like, either.

> The normal current practice here is for branch ChangeLogs to be kept
> in a separate file, not the ChangeLogs that need merging from
> mainline.  (In the case of BIB the branch ChangeLog then goes on the
> top of the mainline one (with an overall "merge from BIB" comment)
> when the merge back to mainline is done.  For branches developing
> new features a new ChangeLog entry describing the overall logical
> effect of the branch changes, not the details of how that state was
> reached, is more appropriate.)

Unfortunately, this is not how BIB was done, and I'm stuck with the
way it is being done now (the normal ChangeLog files are used, and I
resolve the conflict on every merge).  Next time around, it would
certainly be easier to use a separate file -- but better still to
avoid maintaining the files at all.

zw

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-09  9:42         ` Zack Weinberg
@ 2002-12-09 11:00           ` Jack Lloyd
  0 siblings, 0 replies; 60+ messages in thread
From: Jack Lloyd @ 2002-12-09 11:00 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: gcc

On Mon, 9 Dec 2002, Zack Weinberg wrote:

> > 0aa. Checksums stored in the repository format for all file
> > revisions, deltas, log messages etc., with an easy way to verify
> > them - to detect corruption early.)
>
> Worth pointing out that subversion doesn't do as much checksumming as
> we'd like, either.

OpenCM (opencm.org) does really good checksumming; everything is based off
of strong hashes (and RSA signatures where needed). In particular you're 0d
requirement is met in a way that no other CM system (that I've heard about)
can do. Nobody (even root) can substitute one file for another or similiar
nastiness. Well, unless they can break SHA-1 in a real serious way.

I'll mention I work on OpenCM (it's my day job), and additionally I promise
I won't go endlessly promoting it on the list. I'd be happy to answer any
questions off list if you like, but this is the first and last time I'll
bring it up here.

-Jack

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-08 16:09   ` Phil Edwards
  2002-12-08 19:13     ` Zack Weinberg
@ 2002-12-09 15:10     ` Walter Landry
  2002-12-09 15:27       ` Joseph S. Myers
  1 sibling, 1 reply; 60+ messages in thread
From: Walter Landry @ 2002-12-09 15:10 UTC (permalink / raw)
  To: gcc

Hi,

> I'm now going to list the requirements which I would place on a
> replacement for CVS, in rough decreasing order of importance.  I
> haven't done any research to back them up -- this is just off the top
> of my head (but having thought about the issue quite a bit).
> 
> 0. Must be at least as reliable

To my knowledge, arch doesn't have any real reliability problems.  Tom
didn't make a fast implementation, but it is reliable.  There is a bug
database [1] with all of the bugs that I can think of, so you can
decide for yourself whether it is reliable.

> and at least as portable as CVS.

Arch currently doesn't work on 64 bit machines.  The problem is in the
non-shell parts.  It is not an insurmountable problem, it is just that
no one has taken the time to try to fix the bugs.

Someone got it running under cygwin once, but the patches have
disappeared.  It wasn't usable there.  Too slow.

Otherwise, it seems to work on posix machines.

>    GCC is a very large development effort.  We can't afford to lose
>    contributors because their preferred platform is shut out, nor
>    can we afford to lose work due to bugs, and we *especially*
>    cannot risk a system which has not been audited for security
>    exposures.  It would be relatively easy to give much stronger
>    data integrity guarantees than CVS currently manages:

arch doesn't interact at all with root.  The remote repositories are
all done with sftp, ftp, and http, which is as secure as those servers
are.

> 0a. All data stored in the repository is under an end-to-end
>     checksum.  All data transmitted over the network is independently
>     checksummed (yes, redundant with TCP-layer checksums).  CVS does
>     no checksumming at all.

Sort of.  Patches are gzipped, which have their own checksum, but
there is isn't any way to make sure that what you get is the same
thing as what you put in.  That is, there is some individual
checksums, but no end-to-end checksum.

> 0b. Anonymous repository access is done under a user ID that has only
>     OS-level read privileges on the repository's files.  This cannot
>     be done with (unpatched) CVS.

Is http access good enough?

> 0c. Remote write operations on the repository intrinsically require
>     the use of a protocol which makes strong cryptographic integrity
>     and authority guarantees.  CVS can be set up like this, but it's
>     not built into the design.

Currently, we allow writeable ftp servers and sftp servers.  If we
disallowed writeable ftp servers, would that be good enough?  (Don't
tempt me.  I've considered it in the past.)

> 0d. The data stored in the repository cannot be modified by
>     unprivileged local users except by going through the version
>     control system.  Presently I could take 'vi' to one of the ,v
>     files in /cvs/gcc and break it thoroughly, or sneak something into
>     the file content, and leave no trace.

There is no interaction with root, so if you own the archive, you can
always do what you want.  To get anything approaching this, you have
to deal with PGP signatures, SHA hashes, and the like.  OpenCM is
probably the only group (including BitKeeper) that even comes close to
doing this right.

> 1. Must be at least as fast as CVS for all operations, and should be
>    substantially faster for all operations where CVS uses a braindead
>    algorithm.  I would venture to guess that everyone's #1 complaint
>    about CVS is the amount of time we waste waiting for it to complete
>    this or that request.  To be more specific:

Arch is slow, slow, slow.  Don't let Tom beguile you into thinking
that it is even reasonably fast right now.  It isn't.  It is a subject
of great interest to the developers, but we're not there yet.  Part of
this is the shell implementation.  Once certain parts are rewritten in
a compiled language, it should get _much_ better.

> 1a. Efficient network protocol.  Specifically, a network protocol that,
>     for *all* operations, transmits a volume of data proportional --
>     with a small constant! -- to the size of the diff involved, *not*
>     the total size of all the files touched by the diff involved, as
>     CVS does.

Arch has this, although some of the implementations could do with a
little improvement (e.g. the mirroring script seems to take forever).

> 1b. Efficient tags and branches.  It should be possible to create
>     either by creating *one* metadata record, rather than touching
>     every single file in the repository.

Don't know.  I haven't looked at the actual implementation.  There
isn't a fundamental reason why not, though.

> 1c. Efficient delta storage algorithm, such that checking in a change
>     on the tip of a branch is not orders of magnitude slower than
>     checking in a change on the tip of the trunk.  There are several
>     sane ways to do this.

Arch has this

> 1d. Efficient method for extracting a logical change after the fact,
>     no matter how many files it touched.  (Currently the easiest way
>     to do this is: hunt through the gcc-cvs archive until you find the
>     message describing the checkin you care about, then use wget on
>     all of the per-file diff URLs in the list and glue them all
>     together.  Slow, painful, doesn't always work.)

Arch has this

> 2. Should support this laundry list of features, none of which is
>    known to CVS.  Most of them would be useful independent of the
>    others, though there's not much point to 2b without 2a, nor 2e
>    without 2d.
> 
> 2a. Atomic application of a logical change that touches many files,
>     possibly not all in the same directory. (This is commonly known as
>     a "change set".)  One checkin log per change set is adequate.

Arch has this.  It's why I started using it.

> 2b. Ability to back out an entire change set just as atomically as it
>     went in.

In theory, easy to do (just a few rm's and an mv).  There are larger
policy questions, though (Do we want to allow that?).  Some day, I may
just hack something together that does that.

> 2c. Ability to rename a file, including the ability for a file to have
>     different names on different branches.

Arch has this

> 2d. Automatically remember that a merge occurred from branch A to
>     branch B; later, when a second merge occurs from A to B, don't
>     apply those changes again.

Arch has this

> 2e. Understand the notion of a single-delta merge, either applying
>     just one change from branch A to branch B, or removing just one
>     change formerly on branch A ("subtractive merge").

Single delta forward merges are no problem.  Reverse merges are more
difficult.  This is one of those "lurking design issues" that I
mentioned earlier.

> 2f. Perform conflict resolution by automatic formation of
>     microbranches.

I'm not quite sure what you mean here.

> 3. Should allow a user without commit privileges to generate a change
>    set, making arbitrary changes to the repository (none of this "you
>    can edit files and generate diffs but you can't add or delete
>    files" nonsense), which can be applied by a user who does have
>    commit privileges, and when the original author does an update
>    he/she doesn't get spurious conflicts.

Are you thinking of sending patches by email?  Arch doesn't have that.

> 4. The repository's on-disk data should be stored in a highly compact
>    format, to the maximum extent possible and consonant with being
>    fast.  Being fast is much more important; however, GCC's CVS
>    repository is ~800MB in size and compresses down to ~100MB.  You
>    can do interesting things (like keep a copy of the entire
>    repository on every developer's personal hard disk, as Bitkeeper
>    does) with a 100MB repository that are not so practical when it's
>    closer to a gigabyte.

Arch stores the repository as tar.gz of the initial revision, plus
tar.gz of the patches.  This will be about as compact as anything.

The problem comes when you want to get older revisions.  If you're at
patch-51, getting patch-48 means starting from patch-0 and applying
all 48 patches.  This can be sped up by saving entire trees along the
way, but that kills the "highly compact format".

> 5. Should have the ability to generate ChangeLog files automagically
>    from the checkin comments.  (When merging to basic-improvements I
>    normally spend more time fixing up the ChangeLogs than anything
>    else.  Except maybe waiting for 'cvs tag' and 'cvs update -j...'.)

That apparently works, although I've never used it.

By the way, I thought that your comments were quite illuminating, so I
put them up on the arch web site [2].

I also think that Tom should stop telling everyone to work on arch.
At this point, it just causes more trouble than any help I'll get.

Regards,
Walter Landry
wlandry@ucsd.edu

[1] http://bugs.fifthvision.net:8080/
[2] http://www.fifthvision.net/open/bin/view/Arch/GccHackers

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-09 15:10     ` Walter Landry
@ 2002-12-09 15:27       ` Joseph S. Myers
  2002-12-09 17:05         ` Walter Landry
  2002-12-11  1:11         ` Branko Čibej
  0 siblings, 2 replies; 60+ messages in thread
From: Joseph S. Myers @ 2002-12-09 15:27 UTC (permalink / raw)
  To: Walter Landry; +Cc: gcc

On Mon, 9 Dec 2002, Walter Landry wrote:

> arch doesn't interact at all with root.  The remote repositories are
> all done with sftp, ftp, and http, which is as secure as those servers
> are.

Is this - for anonymous access - _plain_ HTTP, or HTTP + WebDAV + DeltaV
which svn uses?  One problem there was with SVN - it may have been fixed
by now, and a fix would be necessary for it to be usable for GCC - was its
use of HTTP and HTTPS (for write access); these tend to be heavily
controlled by firewalls and the ability to tunnel over SSH (with just that
one port needing to be open) would be necessary.  "Transparent" proxies
may pass plain HTTP OK, but not the WebDAV/DeltaV extensions SVN needs.

> > 0d. The data stored in the repository cannot be modified by
> >     unprivileged local users except by going through the version
> >     control system.  Presently I could take 'vi' to one of the ,v
> >     files in /cvs/gcc and break it thoroughly, or sneak something into
> >     the file content, and leave no trace.
> 
> There is no interaction with root, so if you own the archive, you can
> always do what you want.  To get anything approaching this, you have
> to deal with PGP signatures, SHA hashes, and the like.  OpenCM is
> probably the only group (including BitKeeper) that even comes close to
> doing this right.

This sort of thing has been done simply by a modified setuid (to a cvs
user, not root) cvs binary so users can't access the repository directly,
only through that binary.  More generically, with a reasonable protocol
for local repository access it should be possible to use GNU userv to
separate the repository from the users.

> > 2b. Ability to back out an entire change set just as atomically as it
> >     went in.
> 
> In theory, easy to do (just a few rm's and an mv).  There are larger
> policy questions, though (Do we want to allow that?).  Some day, I may
> just hack something together that does that.

A change set is applied.  It turns out to have problems, so needs to be
reverted - common enough.  Of course the version history and ChangeLog
shows both the original application and reversion.  The reversion might in
fact be of the original change set and a series of subsequent failed
attempts at patching it up.  But intermediate unrelated changes to the 
tree should not be backed out in the process.

> > 3. Should allow a user without commit privileges to generate a change
> >    set, making arbitrary changes to the repository (none of this "you
> >    can edit files and generate diffs but you can't add or delete
> >    files" nonsense), which can be applied by a user who does have
> >    commit privileges, and when the original author does an update
> >    he/she doesn't get spurious conflicts.
> 
> Are you thinking of sending patches by email?  Arch doesn't have that.

Patches by email (with distributed patch review by multiple people reading
gcc-patches, including those who can't actually approve the patch) is the
normal way GCC development works.  Presume that most contributors will not
want to deal with security issues of making any local repository
accessible to other machines, even if it's on a permanently connected
machine and local firewalls or policy don't prevent this.

A patch for use with a better version control system would need to include
some encoding for that system of renames / deletes / ... - but that needs
to be just as human-readable as context diffs / unidiffs are.

-- 
Joseph S. Myers
jsm28@cam.ac.uk

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-09 15:27       ` Joseph S. Myers
@ 2002-12-09 17:05         ` Walter Landry
  2002-12-09 17:10           ` Joseph S. Myers
  2002-12-09 17:50           ` Zack Weinberg
  2002-12-11  1:11         ` Branko Čibej
  1 sibling, 2 replies; 60+ messages in thread
From: Walter Landry @ 2002-12-09 17:05 UTC (permalink / raw)
  To: gcc

"Joseph S. Myers" <jsm28@cam.ac.uk> wrote:
> On Mon, 9 Dec 2002, Walter Landry wrote:
> 
> > arch doesn't interact at all with root.  The remote repositories are
> > all done with sftp, ftp, and http, which is as secure as those servers
> > are.
> 
> Is this - for anonymous access - _plain_ HTTP, or HTTP + WebDAV + DeltaV
> which svn uses?  One problem there was with SVN - it may have been fixed
> by now, and a fix would be necessary for it to be usable for GCC - was its
> use of HTTP and HTTPS (for write access); these tend to be heavily
> controlled by firewalls and the ability to tunnel over SSH (with just that
> one port needing to be open) would be necessary.  "Transparent" proxies
> may pass plain HTTP OK, but not the WebDAV/DeltaV extensions SVN needs.

Anonymous access requires for HTTP + WebDAV (no DeltaV).  However, the
set of WebDAV commands needed are much smaller than what subversion
needs.  It just needs whatever anonymous ftp has that http doesn't (I
believe PROPFIND is one).  In particular, you can run a server using
apache 1.3.

> > > 0d. The data stored in the repository cannot be modified by
> > >     unprivileged local users except by going through the version
> > >     control system.  Presently I could take 'vi' to one of the ,v
> > >     files in /cvs/gcc and break it thoroughly, or sneak something into
> > >     the file content, and leave no trace.
> > 
> > There is no interaction with root, so if you own the archive, you can
> > always do what you want.  To get anything approaching this, you have
> > to deal with PGP signatures, SHA hashes, and the like.  OpenCM is
> > probably the only group (including BitKeeper) that even comes close to
> > doing this right.
> 
> This sort of thing has been done simply by a modified setuid (to a cvs
> user, not root) cvs binary so users can't access the repository directly,
> only through that binary.  More generically, with a reasonable protocol
> for local repository access it should be possible to use GNU userv to
> separate the repository from the users.

This is a different security model.  Arch is secure because it doesn't
depend on having priviledged access.  For example, there is an "rm
-rf" command built into arch.

I have a feeling that you are thinking of how CVS handles things, with
a centralized server.  Part of the whole point of arch is that there
is no centralized server.  So, for example, I can develop arch
independently of whether Tom thinks that I am worthy enough to do so.
I can screw up my archive as much as I want (and I have), and Tom can
be blissfully unaware.  Easy merging is what makes this possible.

So you don't, in general, have a repository that is writeable by more
than one person.

> > > 2b. Ability to back out an entire change set just as atomically as it
> > >     went in.
> > 
> > In theory, easy to do (just a few rm's and an mv).  There are larger
> > policy questions, though (Do we want to allow that?).  Some day, I may
> > just hack something together that does that.
> 
> A change set is applied.  It turns out to have problems, so needs to be
> reverted - common enough.  Of course the version history and ChangeLog
> shows both the original application and reversion.  The reversion might in
> fact be of the original change set and a series of subsequent failed
> attempts at patching it up.  But intermediate unrelated changes to the 
> tree should not be backed out in the process.

To get what you really want means that we can reverse our patches.
Then you could simply unapply a patch.  But that isn't possible right
now, and is not going to be done real soon.  That would require
someone who is actually working on the code to understand the current
patch format.

Regards,
Walter Landry
wlandry@ucsd.edu

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-09 17:05         ` Walter Landry
@ 2002-12-09 17:10           ` Joseph S. Myers
  2002-12-09 18:27             ` Walter Landry
  2002-12-09 17:50           ` Zack Weinberg
  1 sibling, 1 reply; 60+ messages in thread
From: Joseph S. Myers @ 2002-12-09 17:10 UTC (permalink / raw)
  To: Walter Landry; +Cc: gcc

On Mon, 9 Dec 2002, Walter Landry wrote:

> Anonymous access requires for HTTP + WebDAV (no DeltaV).  However, the
> set of WebDAV commands needed are much smaller than what subversion
> needs.  It just needs whatever anonymous ftp has that http doesn't (I
> believe PROPFIND is one).  In particular, you can run a server using
> apache 1.3.

I'm sure some "transparent" proxies will fail to pass even that (though
WebDAV may be better supported by them than DeltaV).  This is similar to
Zack's first point - just as any new system must be no less portable to
running on different systems, it must be no less portable to working
through networks restricted in different ways.

> I have a feeling that you are thinking of how CVS handles things, with
> a centralized server.  Part of the whole point of arch is that there
> is no centralized server.  So, for example, I can develop arch
> independently of whether Tom thinks that I am worthy enough to do so.
> I can screw up my archive as much as I want (and I have), and Tom can
> be blissfully unaware.  Easy merging is what makes this possible.
> 
> So you don't, in general, have a repository that is writeable by more
> than one person.

For GCC there clearly needs to be some server that has the mainline of
development we advertise on our web pages for users, from which release
branches are made, which has some vague notions of the machine being
securely maintained, having adequate bandwidth, having some backup
procedure, having maintainers for the server keeping it up reliably,
having a reasonable expectation that the development lines in there will
still be available in 20 years' time when current developers have lost
interest.  (gcc.gnu.org presents a remarkably good impression of this to
the outside world, considering how it operates purely by volunteer
effort.)

There may be many other servers - private and public - but some server
provides the line of development that gets branched into new releases, and
inevitably multiple people may write to that line.  (I'm also presuming -
see <http://gcc.gnu.org/ml/gcc/2002-12/msg00436.html> - that all the
developments in any third party repository that get discussed on the lists
should be mirrored into this main one to give some hope of long term
survival and availability.  In developing GCC with list archives and
version control we are simultaneously acting as curators of the history of
GCC development, which means attempting to preserve that history for
posterity (a period beyond the involvement of any one individual).)

-- 
Joseph S. Myers
jsm28@cam.ac.uk

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-09 17:10           ` Joseph S. Myers
@ 2002-12-09 18:27             ` Walter Landry
  2002-12-09 19:16               ` Joseph S. Myers
  0 siblings, 1 reply; 60+ messages in thread
From: Walter Landry @ 2002-12-09 18:27 UTC (permalink / raw)
  To: gcc

"Joseph S. Myers" <jsm28@cam.ac.uk> wrote:
> On Mon, 9 Dec 2002, Walter Landry wrote:
> 
> > Anonymous access requires for HTTP + WebDAV (no DeltaV).  However, the
> > set of WebDAV commands needed are much smaller than what subversion
> > needs.  It just needs whatever anonymous ftp has that http doesn't (I
> > believe PROPFIND is one).  In particular, you can run a server using
> > apache 1.3.
> 
> I'm sure some "transparent" proxies will fail to pass even that (though
> WebDAV may be better supported by them than DeltaV).  This is similar to
> Zack's first point - just as any new system must be no less portable to
> running on different systems, it must be no less portable to working
> through networks restricted in different ways.

Well, there is anonymous ftp.  But if all you have is plain http, I
would think that you would have problems checking out from CVS as
well.

> > So you don't, in general, have a repository that is writeable by more
> > than one person.
> 
> For GCC there clearly needs to be some server that has the mainline of
> development we advertise on our web pages for users, from which release
> branches are made, which has some vague notions of the machine being
> securely maintained, having adequate bandwidth, having some backup
> procedure, having maintainers for the server keeping it up reliably,
> having a reasonable expectation that the development lines in there will
> still be available in 20 years' time when current developers have lost
> interest.  (gcc.gnu.org presents a remarkably good impression of this to
> the outside world, considering how it operates purely by volunteer
> effort.)

That, presumably, would be the release manager's branch.
Periodically, people would say, "feature X is implemented on branch
Y".  If the release manager trusts them, then he does a simple update.
If there is no trust, then the release manager can review the patches.
In any case, assuming the submitter knows what they are doing, the
patch will apply cleanly.  It would be very quick.  If it doesn't
apply cleanly, then the release manager sends a curt note to the
submitter (perhaps automatically) or tries to resolve it himself.
This is how the Linux kernel development works, although a release
manager wouldn't have to do as much work as Linus does.

Regards,
Walter Landry
wlandry@ucsd.edu

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-09 18:27             ` Walter Landry
@ 2002-12-09 19:16               ` Joseph S. Myers
  2002-12-10  0:27                 ` Zack Weinberg
  0 siblings, 1 reply; 60+ messages in thread
From: Joseph S. Myers @ 2002-12-09 19:16 UTC (permalink / raw)
  To: Walter Landry; +Cc: gcc

On Mon, 9 Dec 2002, Walter Landry wrote:

> Well, there is anonymous ftp.  But if all you have is plain http, I
> would think that you would have problems checking out from CVS as
> well.

FTP is probably useful in most such cases (for arch, since I didn't think
svn provided FTP transport; and I don't know about the other systems
mentioned).  The case I was thinking of is the common situation where most
outgoing ports are free but port 80 is redirected through a "transparent"
proxy to save ISP bandwidth.  (In such situations, a few other outgoing
ports such 25 are probably proxied but are irrelevant here.)  That's one
common (consumer) situation, and FTP and HTTPS probably work there, but
another (corporate) situation may well have HTTPS tied down more tightly.  
Where people have got pserver or ssh allowed through their firewall, there
may be more problems with protocols used for other purposes and restricted
or proxied for other reasons.

(Some blocks might be avoided by choosing nonstandard ports, but then
everyone is likely to choose different ports and create more confusion.  
Arranging that both write and anonymous access is tunnelled over ssh - as
some sites do for anonymous CVS access - simplifies things.)

> That, presumably, would be the release manager's branch.
> Periodically, people would say, "feature X is implemented on branch
> Y".  If the release manager trusts them, then he does a simple update.
> If there is no trust, then the release manager can review the patches.
> In any case, assuming the submitter knows what they are doing, the
> patch will apply cleanly.  It would be very quick.  If it doesn't
> apply cleanly, then the release manager sends a curt note to the
> submitter (perhaps automatically) or tries to resolve it himself.
> This is how the Linux kernel development works, although a release
> manager wouldn't have to do as much work as Linus does.

There are about 100 people applying patches to the mainline (half
maintainers of some of the code who can apply some patches without review,
half needing review for all nonobvious patches).  Having the release
manager manually handle the patches from all 100 people is not a sensible
scalable solution for GCC; the expectation is that anyone producing a
reasonable number of good patches will get write access which reduces the
reviewers' effort (to needing only to review the patch, not apply it) and
means that the version control logs clearly show which user was
responsible for a patch by who checked it in (the case of someone else,
named in the log message, being responsible, being the exceptional case).
Note that the 50 or so maintainers all do some patch review; it's only at 
a late stage on the actual release branches that the review is 
concentrated in the release manager.

(You might then say that the release manager could have a bot
automatically applying patches from developers who now have write access,
but this has no real advantages over them all having write access and a
lot of fragility added in.)

The Linux model of one person controlling everything going into the
mainline is exceptional; GCC, *BSD, etc., have many committers to mainline
(the rules for who commits where with what review varying) and as Zack
explains <http://gcc.gnu.org/ml/gcc/2002-12/msg00492.html> (albeit missing
the footnote [1] on where releases are made from) this mainline on a
master server will remain central, with new developments normally going
there rapidly except for various major work on longer-term surrounding
branches.  Zack notes that practically the main use of a distributed 
system would be for individual developers to do their work offline, not to 
move the main repository for communication between developers off a single 
machine (though depending on the system, developers may naturally have 
repository mirrors) - it is not in general the case that development takes 
place on always-online systems or systems which can allow remote access to 
their repositories.

I expect for most branches it will also be most convenient for the master
server to host them.  The exceptions are most likely to be for
developments that aren't considered politically appropriate for mainstream
GCC, or those that aren't assigned to the FSF or may have other legal
problems, or those done under NDA (albeit legally dodgy), or corporate
developments whose public visibility too early would give away sensitive
information or which a customer would like to have before they go public
(i.e., work eventually destined for public GCC unless too specialised or
ugly, where the customer and company would be free under the GPL to
release the work early but choose not to).  In general, I expect most
development would be on the central server, except for small-scale
individual development (often offline) on personal servers and corporate
development on internal systems that definitely will not be accessible to
the public.

This is, of course, all just hypothesis about how GCC development would
work with distributed CM, but it seems a reasonable extrapolation
supposing we start from wanting to preserve security, accessibility and
long-term survival of all the version history of developments that
presently go in the public repository (mainline and branches).

-- 
Joseph S. Myers
jsm28@cam.ac.uk

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-09 19:16               ` Joseph S. Myers
@ 2002-12-10  0:27                 ` Zack Weinberg
  2002-12-10  0:41                   ` Tom Lord
                                     ` (3 more replies)
  0 siblings, 4 replies; 60+ messages in thread
From: Zack Weinberg @ 2002-12-10  0:27 UTC (permalink / raw)
  To: Joseph S. Myers; +Cc: Walter Landry, gcc

"Joseph S. Myers" <jsm28@cam.ac.uk> writes:

> The Linux model of one person controlling everything going into the
> mainline is exceptional; GCC, *BSD, etc., have many committers to
> mainline (the rules for who commits where with what review varying)
> and as Zack explains <http://gcc.gnu.org/ml/gcc/2002-12/msg00492.html> 
> (albeit missing the footnote [1] on where releases are made from)
> this mainline on a master server will remain central, with new
> developments normally going there rapidly except for various major
> work on longer-term surrounding branches.

The missing footnote was going to be an argument that the Linux model
is not just exceptional, but pathological.  Not something I think we
should emulate with GCC, and not something I consider worth designing
a version control system to support.

Linux is a large project - 4.3 million lines of code - but only one
person has commit privileges on the official tree, for any given
release branch.  No matter how good their tools are, this cannot be
expected to scale, and indeed it does not.  I have not actually
measured it, but the appearance of the traffic on linux-kernel is that
Linus drops patches on the floor just as often as he did before he
started using Bitkeeper.  However, Bitkeeper facilitates other people
maintaining their own semi-official versions of the tree, in which
some of these patches get sucked up.  That is bad.  It means users
have to choose between N different variants; as time goes by it
becomes increasingly difficult to put them all back together again;
eventually will come a point where critical feature A is available
only in tree A, critical feature B is available only in tree B, and
the implementations conflict, because no one's exerting adequate
centripetal force.

Possibly I am too pessimistic.

zw

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-10  0:27                 ` Zack Weinberg
@ 2002-12-10  0:41                   ` Tom Lord
  2002-12-10 12:05                   ` Phil Edwards
                                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 60+ messages in thread
From: Tom Lord @ 2002-12-10  0:41 UTC (permalink / raw)
  To: zack; +Cc: gcc


       Zack:

       Linux is a large project - 4.3 million lines of code - but only one
       person has commit privileges on the official tree, for any given
       release branch.  No matter how good their tools are, this cannot be
       expected to scale, and indeed it does not.

I hope you'll have a look at the process automation scenario in my
reply to Joseph S. Myers ("new patch of replies (B)").

-t

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-10  0:27                 ` Zack Weinberg
  2002-12-10  0:41                   ` Tom Lord
@ 2002-12-10 12:05                   ` Phil Edwards
  2002-12-10 19:44                   ` Mark Mielke
  2002-12-14 13:43                   ` Linus Torvalds
  3 siblings, 0 replies; 60+ messages in thread
From: Phil Edwards @ 2002-12-10 12:05 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: Joseph S. Myers, Walter Landry, gcc

On Mon, Dec 09, 2002 at 11:26:01PM -0800, Zack Weinberg wrote:
> the implementations conflict, because no one's exerting adequate
> centripetal force.

Heh.  I never thought I'd hear that term applied to software development.
I like it.



Phil

-- 
I would therefore like to posit that computing's central challenge, viz. "How
not to make a mess of it," has /not/ been met.
                                                 - Edsger Dijkstra, 1930-2002

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-10  0:27                 ` Zack Weinberg
  2002-12-10  0:41                   ` Tom Lord
  2002-12-10 12:05                   ` Phil Edwards
@ 2002-12-10 19:44                   ` Mark Mielke
  2002-12-10 19:57                     ` David S. Miller
  2002-12-14 13:43                   ` Linus Torvalds
  3 siblings, 1 reply; 60+ messages in thread
From: Mark Mielke @ 2002-12-10 19:44 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: Joseph S. Myers, Walter Landry, gcc

On Mon, Dec 09, 2002 at 11:26:01PM -0800, Zack Weinberg wrote:
> Linux is a large project - 4.3 million lines of code - but only one
> person has commit privileges on the official tree, for any given
> release branch.  No matter how good their tools are, this cannot be
> expected to scale, and indeed it does not.  I have not actually
> measured it, but the appearance of the traffic on linux-kernel is that
> Linus drops patches on the floor just as often as he did before he
> started using Bitkeeper.  However, Bitkeeper facilitates other people
> maintaining their own semi-official versions of the tree, in which
> some of these patches get sucked up.  That is bad.  It means users
> have to choose between N different variants; as time goes by it
> becomes increasingly difficult to put them all back together again;
> eventually will come a point where critical feature A is available
> only in tree A, critical feature B is available only in tree B, and
> the implementations conflict, because no one's exerting adequate
> centripetal force.
> Possibly I am too pessimistic.

Actually, the model used for Linux provides substantial freedom. Since
no single site is the 'central' site, development can be fully
distributed. Changes can be merged back and forth on demand, and
remote users require no resources to run, other than the resources to
periodically synchronize the data.

Unfortunately -- this freedom (as always) comes with a price. The
price is that the fully distributed model means that there is no
enforced regulation. There is no control, and the same freedom that
allows anybody to create a variant, allows them to keep a variant.

The models are substantially different, however, I would suggest that
neither is wrong in the generic sense.

The only questions that really matters are: 1) are you more
comfortable in a regulated environment, and if so, then 2) are you
willing to live with the limitations that a regulated environment
gives you? Some of these limitations include the need to maintain
contact with a central repository of some sort, and the need for
processing at a central repository of some sort.

Personally, I'm with you in that I prefer regulation and enforcement.
It keeps me from fsck'ing up my own data.

mark

-- 
mark@mielke.cc/markm@ncf.ca/markm@nortelnetworks.com __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
                       and in the darkness bind them...

                           http://mark.mielke.cc/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-10 19:44                   ` Mark Mielke
@ 2002-12-10 19:57                     ` David S. Miller
  2002-12-10 20:02                       ` Phil Edwards
  0 siblings, 1 reply; 60+ messages in thread
From: David S. Miller @ 2002-12-10 19:57 UTC (permalink / raw)
  To: mark; +Cc: zack, jsm28, wlandry, gcc

   From: Mark Mielke <mark@mark.mielke.cc>
   Date: Tue, 10 Dec 2002 22:42:24 -0500

   On Mon, Dec 09, 2002 at 11:26:01PM -0800, Zack Weinberg wrote:
   > Linux is a large project - 4.3 million lines of code - but only one
   > person has commit privileges on the official tree, for any given
   > release branch.  No matter how good their tools are, this cannot be
   > expected to scale, and indeed it does not.  I have not actually
   > measured it, but the appearance of the traffic on linux-kernel is that
   > Linus drops patches on the floor just as often as he did before he
   > started using Bitkeeper.  However, Bitkeeper facilitates other people
   > maintaining their own semi-official versions of the tree, in which
   > some of these patches get sucked up.  That is bad.  It means users
   > have to choose between N different variants; as time goes by it
   > becomes increasingly difficult to put them all back together again;
   > eventually will come a point where critical feature A is available
   > only in tree A, critical feature B is available only in tree B, and
   > the implementations conflict, because no one's exerting adequate
   > centripetal force.
   > Possibly I am too pessimistic.

   Actually, the model used for Linux provides substantial freedom. Since
   no single site is the 'central' site, development can be fully
   distributed. Changes can be merged back and forth on demand, and
   remote users require no resources to run, other than the resources to
   periodically synchronize the data.

I think some assesments are wrong here.

Linus does get more patches applied these days, and less gets
dropped on the floor.

Near the end of November, as we were approaching the feature
freeze deadline, he was merging on the order of 4MB of code
per day if not more.

What really ends up happening also is that Linus begins to trust
people with entire subsystems.  So when Linus pulls changes from
their BK tree, he can see if they touch any files outside of their
areas of responsibility.

Linus used to drop my work often, and I would just retransmit until
he took it.  Now with BitKeeper, I honestly can't remember the last
time he silently dropped a code push I sent to him.

The big win with BitKeeper is the whole disconnected operation bit.

When the net goes down, I can't check RCS history and make diffs
against older versions of files in the gcc tree.

With Bitkeeper I have all the revision history in my cloned tree so
there is zero need for me to every go out onto the network to do work
until I want to share my changes with other people.  This also
decreases the load on the machine with the "master" repository.

There is nothing about this which makes it incompatible with how GCC
works today.  So if arch and/or subversions can support the kind of
model BitKeeper can, we'd set it up like so:

1) gcc.gnu.org would still hold the "master" repository
2) there would be trusted people with write permission who
   could thusly push their changes into the master tree

Releases and tagging would still be done by someone like Mark
except it hopefully wouldn't take several hours to do it :-)

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-10 19:57                     ` David S. Miller
@ 2002-12-10 20:02                       ` Phil Edwards
  2002-12-10 23:07                         ` David S. Miller
  0 siblings, 1 reply; 60+ messages in thread
From: Phil Edwards @ 2002-12-10 20:02 UTC (permalink / raw)
  To: David S. Miller; +Cc: mark, zack, jsm28, wlandry, gcc

On Tue, Dec 10, 2002 at 07:41:05PM -0800, David S. Miller wrote:
> When the net goes down, I can't check RCS history and make diffs
> against older versions of files in the gcc tree.

I just rsync the repository and do everything but checkins locally.
Very very fast.

> With Bitkeeper I have all the revision history in my cloned tree so
> there is zero need for me to every go out onto the network to do work
> until I want to share my changes with other people.  This also
> decreases the load on the machine with the "master" repository.

So does the rsync-repo technique.


Phil

-- 
I would therefore like to posit that computing's central challenge, viz. "How
not to make a mess of it," has /not/ been met.
                                                 - Edsger Dijkstra, 1930-2002

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-10 20:02                       ` Phil Edwards
@ 2002-12-10 23:07                         ` David S. Miller
  2002-12-11  6:31                           ` Phil Edwards
  0 siblings, 1 reply; 60+ messages in thread
From: David S. Miller @ 2002-12-10 23:07 UTC (permalink / raw)
  To: phil; +Cc: mark, zack, jsm28, wlandry, gcc

   From: Phil Edwards <phil@jaj.com>
   Date: Tue, 10 Dec 2002 22:57:24 -0500

   > With Bitkeeper I have all the revision history in my cloned tree so
   > there is zero need for me to every go out onto the network to do work
   > until I want to share my changes with other people.  This also
   > decreases the load on the machine with the "master" repository.
   
   So does the rsync-repo technique.
   
That's not distributed source management, that's "I copy the entire
master tree onto my computer."

If you make modifications to your local rsync'd master tree, you can't
transparently push those changes to other people unless you setup
anoncvs on your computer and tell them "use this as your master repo
instead of gcc.gnu.org to get my changes".

That's bolted onto the side, not part of the design.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-10 23:07                         ` David S. Miller
@ 2002-12-11  6:31                           ` Phil Edwards
  0 siblings, 0 replies; 60+ messages in thread
From: Phil Edwards @ 2002-12-11  6:31 UTC (permalink / raw)
  To: David S. Miller; +Cc: mark, zack, jsm28, wlandry, gcc

On Tue, Dec 10, 2002 at 07:58:36PM -0800, David S. Miller wrote:
>    From: Phil Edwards <phil@jaj.com>
>    Date: Tue, 10 Dec 2002 22:57:24 -0500
> 
>    > With Bitkeeper I have all the revision history in my cloned tree so
>    > there is zero need for me to every go out onto the network to do work
>    > until I want to share my changes with other people.  This also
>    > decreases the load on the machine with the "master" repository.
>    
>    So does the rsync-repo technique.
>    
> That's not distributed source management, that's "I copy the entire
> master tree onto my computer."

I'm not claiming otherwise.  I'm simply offering a tip to make life easier
for current users in the current situation.  What I said is still true
with regards to the paragraph I quoted.


Phil

-- 
I would therefore like to posit that computing's central challenge, viz. "How
not to make a mess of it," has /not/ been met.
                                                 - Edsger Dijkstra, 1930-2002

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-10  0:27                 ` Zack Weinberg
                                     ` (2 preceding siblings ...)
  2002-12-10 19:44                   ` Mark Mielke
@ 2002-12-14 13:43                   ` Linus Torvalds
  2002-12-14 14:06                     ` Tom Lord
                                       ` (2 more replies)
  3 siblings, 3 replies; 60+ messages in thread
From: Linus Torvalds @ 2002-12-14 13:43 UTC (permalink / raw)
  To: zack, gcc

[ See the blurb about OpenCM at the end. ]

In article <87isy2mj1y.fsf@egil.codesourcery.com>,
Zack Weinberg  <zack@codesourcery.com> wrote:
>
>Linux is a large project - 4.3 million lines of code - but only one
>person has commit privileges on the official tree, for any given
>release branch.

No. That's not how it works.

Linux, unlike _every_ other project I know of, has always actively
encouraged "personal/vendor branches", and that is in fact how 99% of
all development has happened.

Most development happens in trees that have _nothing_ to do with the
official tree.  To me, the whole CVS model (many branches in one
centralized repository) is just incredibly broken, and you should
realize that that isn't how Linux has ever worked. 

My tree is often called the "official" tree, but what it really is is
just a base tree that many people maintain their own forks from. 
This is fundamentally _more_ scalable than the CVS mess that is gcc,
since it much more easily allows for very radical branches that do not
need any centralized permission from me. 

Think of it this way: in gcc, the egcs split was a very painful thing.  
In Linux, those kinds of splits (people doing what they think is right,
_without_ support from the official maintainers) is how _everything_ gets
done.  Linux is a "constantly forking" project, and that's how development
very fundamentally happens.

And a fork is a lot more scalable than a branch.  It's also a lot more
powerful: it gives _full_ rights to the forker.  That implies that a
forked source tree should be a first-class citizen, not just something
that was copied off somebody elses CVS tree.  The BitKeeper "clone"
thing is a beautiful implementation of the Linux development model. 

>	  No matter how good their tools are, this cannot be
>expected to scale, and indeed it does not.

Sorry, but you're wrong.  Probably simply because you're too used to the
broken CVS model.

I would like to point out that Linux development has scaled a lot better
than gcc, to a larger source base (it's 5+ M lines) with much more
fundamental programming issues (concurrency etc).  I will bet you that
the Linux kernel merges are a lot bigger than the gcc ones, that
development happens faster, and that there are more independent
developers working on their own versions of Linux than there are of gcc. 

There aren't just a handful of branches, there are _hundreds_. Many of
them end up not being interesting, or ever necessarily merged back. And
_none_ of them required write access to my tree.

I'd also like to point out that Linux has _never_ had a flap like the
gcc/egcs/emacs/xemacs splits.  Exactly because of the _much_ more
scalable approach of just fundamentally always having had a distributed
development model that allows _anybody_ to contribute easily, instead of
having a model that makes certain people have "special powers". 

In short, _my_ tree is _not_ the same thing as the gcc CVS sources.

>				  I have not actually
>measured it, but the appearance of the traffic on linux-kernel is that
>Linus drops patches on the floor just as often as he did before he
>started using Bitkeeper.

Measure the number of changes accepted, and I bet the Linux kernel
approach had an order of magnitude more changes than gcc has _ever_ had.
Even before using Bitkeeper.

The proof is in the pudding - care to compare real numbers, and compare
sizes of weekly merged patches? I bet gcc will come in _far_ behind.

>		  However, Bitkeeper facilitates other people
>maintaining their own semi-official versions of the tree, in which
>some of these patches get sucked up.  That is bad.

No. Have you ever used Bitkeeper? Really _used_ it?

I've used both bitkeeper and CVS (I refuse to touch CVS with a ten-foot
pole for my "fun" projects, but I've used CVS for big projects at work),
and I can tell you, CVS doesn't even come _close_.  Not even with
various wrapper helper tools to make things like CVS branches look even
remotely palatable. 

The part that you're missing, simply because you've probably used CVS
for too long, is the _distributed_ nature of Bitkeeper, and of Linux
development.  Repeat after me: "There is no single tree".  Everything is
distributed. 

Any source control system that has "write access" issues is
fundamentally broken in my opinion.  Your repository should be _yours_,
and nobody elses.  There is no "write access".  There is only some way
to expedite a merge between two repositories.  The source control
management should make it easy for you to export your changes to other
repositories.  In fact, it should make it easy for you to have many
different repositories - for different things you're working on. 

Bitkeeper does this very well.  It's _the_ reason I use bitkeeper.  BK
does other things too, but they all pale to the _fundamental_ idea of
each repository being a thing unto itself, and having no stupid
"branches", but simply having truly distributed repositories.  Some
people think that is a "offline" feature, but nothing could be further
from the truth.  The _real_ issue about independent repositories is that
it makes it possible to do truly independent development, and makes
notions like branches such an outdated idea. 

Projects like Subversion never seem to have really _understood_ the
notion of true distributed repositories. And by not understanding them,
like you they miss the whole point of truly scalable development.
Development that scales _past_ the notion of one central repository.

>Possibly I am too pessimistic.

No. You're not pessimistic, you just don't _understand_.

You don't have to believe me.  Believe the numbers.  Look at which
project gets more done.  And realize that even before Linux used
Bitkeeper, it used the truly distributed _model_.  The model is
independent from what SCM you use, although some SCMs obviously cannot
support some models (and CVS in particular forces its users to use a
particularly broken model). 

Btw, I realize that there's no way in hell gcc will use bitkeeper.  I'm
not trying to push that.  I'm just hoping that if gcc does change to
something smarter than CVS, it would change to something that is truly
distributed, and doesn't have that broken "branch" notion, or the notion
of needing write permissions to some stupid central repository in order
to enjoy the idea of SCM. 

Looking at the current projects out there, the only one that looks like
it has more than half a clue is "OpenCM".  It doesn't seem to really do
the distributed thing right, but at least from what I've seen it looks
like they have the right notions about doing it.

The OpenCM project seems to still believe that distribution is just
about "disconnected commits" rather than understanding that if you do
distributed repositories right you shouldn't have branches at all
(instead of a branch, you should just have a _different_ repository),
but they at least seem to understand the importance of true
distribution.  I hope gcc developers are giving it a look. 

			Linus

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-14 13:43                   ` Linus Torvalds
@ 2002-12-14 14:06                     ` Tom Lord
  2002-12-14 17:44                       ` Linus Torvalds
  2002-12-14 14:41                     ` Neil Booth
  2002-12-14 15:33                     ` Momchil Velikov
  2 siblings, 1 reply; 60+ messages in thread
From: Tom Lord @ 2002-12-14 14:06 UTC (permalink / raw)
  To: gcc; +Cc: torvalds

       The OpenCM project seems to still believe that distribution is
       just about "disconnected commits" rather than understanding
       that if you do distributed repositories right you shouldn't
       have branches at all (instead of a branch, you should just have
       a _different_ repository),

Branches can span repository boundaries just fine, and that's a nice
way to keep useful track of the history that relates the two forks.
Distribution is orthogonal to branching.  Two repositories can be 
separately administered and fully severable, yet branches can usefully
exist between them.  For the "branched from" repository, this is a 
passive, read-only operation.

	Looking at the current projects out there, the only one that
	looks like it has more than half a clue is "OpenCM".  It
	doesn't seem to really do the distributed thing right, but at
	least from what I've seen it looks like they have the right
	notions about doing it.

What aspect of arch has you confused?  or, alternatively, what flaw
do you see in arch's approach to distribution?

	The _real_ issue about independent repositories is that it
	makes it possible to do truly independent development, and
	makes notions like branches such an outdated idea.

Arranging that one line is a branch of another (even when they are in
two independent, severable repositories) facilitates simpler and more
accurate queries about how their histories relate.  Among other
things, such queries can (a) take more of the drudgery out of some
common merging tasks, (b) better facilitate process automation when
the forks are, in fact, contributing to a common line of development.

GCC development faces a problem which Linux kernel development, you
seem to have said elsewhere, avoids by social means: it has direct and
appreciated contributors to the mainline who, nevertheless, are asked
to contribute their changes indirectly through a formal review and
testing process (rather than through, say, a "trusted lieutenant" --
in other words, in GCC, the work pool of the "lieutenants"
("maintainers", actually) is collected and shared among them in
flexible, fine-grained ways that are performed with considerable
discipline).  Distribution _with_ branches can be a boon to those
contributors and the maintainers.

Overall -- I don't think there can be _that_ much contrast between the
GCC and LK development processes.  GCC is a bit like LK, except that
instead of a Linus, GCC has a team.  That team needs (and has) tools
to make them effective as the "virtual linus".  (Some of us have ideas
for even better tools :-) That there is less of a tendancy for 3rd
parties to throw up their arms and make their own forks may not have
quite the implications you assert: the natures of the two systems and
the uses they are put to make comparison very difficult.

-t

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-14 14:06                     ` Tom Lord
@ 2002-12-14 17:44                       ` Linus Torvalds
  2002-12-14 19:45                         ` Tom Lord
  0 siblings, 1 reply; 60+ messages in thread
From: Linus Torvalds @ 2002-12-14 17:44 UTC (permalink / raw)
  To: Tom Lord; +Cc: gcc

On Sat, 14 Dec 2002, Tom Lord wrote:
>
> What aspect of arch has you confused?  or, alternatively, what flaw
> do you see in arch's approach to distribution?

To be honest, I tried arch back when I was testing different SCM's for the
kernel, and even just the setup confused me enough that I never got past
that phase. I suspect I just tried it too early in the development cycle,
and that turned me off it.

Also, the oft-repeated performance issues have kept me wary about arch.
Bitkeeper is quite fast, but even so Larry and friends actually ended
having to make some major performance improvements to the bk-3 release
simply because they were taken by surprise at just _how_ much data the
kernel SCM ends up needing to process.

I realize that there are a lot of advantages to keeping to high-level
scripting languages for the SCM, but it's also quite important to try to
avoid making the SCM itself be a distraction from a performance
standpoint. However, since I never got very far with arch, I really only
parrot what I've heard from others about its performance, so this may be
unfair.

			Linus

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-14 17:44                       ` Linus Torvalds
@ 2002-12-14 19:45                         ` Tom Lord
  0 siblings, 0 replies; 60+ messages in thread
From: Tom Lord @ 2002-12-14 19:45 UTC (permalink / raw)
  To: torvalds; +Cc: gcc

       > Also, the oft-repeated performance issues have kept me wary
       > about arch.

Fair if you're evaluating it from the "should I start using this
tomorrow" perspective (don't).

I think most of us who are fairly deep into arch think these problems
have straightforward solutions, and my goal is to try to find a
solution to the resource crisis that keeps me from finishing the work.

	 I realize that there are a lot of advantages to keeping to
	 high-level scripting languages for the SCM, but it's also
	 quite important to try to avoid making the SCM itself be a
	 distraction from a performance standpoint. However, since I
	 never got very far with arch, I really only parrot what I've
	 heard from others about its performance, so this may be
	 unfair.

The prototype/reference implementation of arch _is_ a mixture of shell
scripts and small C programs.  I think the enforced simplicity is very
good for the architecture and I'm quite optimistic about the future
performance potential of this code.

arch is tiny, and I'm encouraging alternative implementations for a
variety of purposes.  I hear that (have some salt grains with this)
someone is working on one in C++, and someone else on one in Python.
A Perl translation was made, but work on it seems to have stopped
(perhaps because the author changed work contexts) around the time it
was starting to function.

It is not quite accurate to say "the current implementation is slow
because it uses sh" -- some sh parts need recasting in C, many don't, 
some of the admin tweaks that improve performance need to be made more
automatic....things like that.   It's an optimizable prototype that
has not been prematurely optimized.

Just reading what you say here: the arch design has everything you
like about BK and probably a bit more to boot.  It's just a resource
problem to get it to a 1.0 that is as comfortable to adopt as you've
found BK.  Rumours that that will require $12M are exaggerated by, in
my estimate, about a factor of 10.

	To be honest, I tried arch back when I was testing different
	SCM's for the kernel, and even just the setup confused me
	enough that I never got past that phase. I suspect I just
	tried it too early in the development cycle, and that turned
	me off it.

Perhaps.  The currently active developers seem to be giving a lot of
attention to encapsulating matters such as that in convenience
commands layered over the core.

-t

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-14 13:43                   ` Linus Torvalds
  2002-12-14 14:06                     ` Tom Lord
@ 2002-12-14 14:41                     ` Neil Booth
  2002-12-14 15:47                       ` Zack Weinberg
  2002-12-14 15:33                     ` Momchil Velikov
  2 siblings, 1 reply; 60+ messages in thread
From: Neil Booth @ 2002-12-14 14:41 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: zack, gcc

Linus Torvalds wrote:-

> The part that you're missing, simply because you've probably used CVS
> for too long, is the _distributed_ nature of Bitkeeper, and of Linux
> development.  Repeat after me: "There is no single tree".  Everything is
> distributed. 

Uh, careful, Zack wrote parts of Bitkeeper, including designing the network
protocols IIRC.

Neil.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-14 14:41                     ` Neil Booth
@ 2002-12-14 15:47                       ` Zack Weinberg
  0 siblings, 0 replies; 60+ messages in thread
From: Zack Weinberg @ 2002-12-14 15:47 UTC (permalink / raw)
  To: Neil Booth; +Cc: Linus Torvalds, gcc

Neil Booth <neil@daikokuya.co.uk> writes:

> Linus Torvalds wrote:-
>
>> The part that you're missing, simply because you've probably used CVS
>> for too long, is the _distributed_ nature of Bitkeeper, and of Linux
>> development.  Repeat after me: "There is no single tree".  Everything is
>> distributed. 
>
> Uh, careful, Zack wrote parts of Bitkeeper, including designing the network
> protocols IIRC.

It is my understanding that the network protocol I designed is no
longer in use, and good riddance, it was my first try at such things
and I didn't know what I was doing.

But yes, I worked on Bitkeeper for about six months in 2000, so I do
know what its architecture is like.

zw

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-14 13:43                   ` Linus Torvalds
  2002-12-14 14:06                     ` Tom Lord
  2002-12-14 14:41                     ` Neil Booth
@ 2002-12-14 15:33                     ` Momchil Velikov
  2002-12-14 16:06                       ` Linus Torvalds
  2 siblings, 1 reply; 60+ messages in thread
From: Momchil Velikov @ 2002-12-14 15:33 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: zack, gcc

>>>>> "Linus" == Linus Torvalds <torvalds@transmeta.com> writes:
    Linus> My tree is often called the "official" tree, but what it
    Linus> really is is just a base tree that many people maintain
    Linus> their own forks from.  This is fundamentally _more_

Err, haven't you noticed that this is the tree that many (all) people
want to merge their forks into ? I think this is "what it really is".

When evaluating a SCM tool, IMHO, the most important is the ease of
merges - remove the need for later merges and any sophisticated "fork"
tool boils down to a "cp -R".

    Linus> scalable than the CVS mess that is gcc, since it much more
    Linus> easily allows for very radical branches that do not need
    Linus> any centralized permission from me.

I surely have a "fork" of GCC and I ain't got nobody's permission.
Permission is needed not when forking, but when merging.

    Linus> Think of it this way: in gcc, the egcs split was a very
    Linus> painful thing.  In Linux, those kinds of splits (people
    Linus> doing what they think is right, _without_ support from the
    Linus> official maintainers) is how _everything_ gets done.  Linux
    Linus> is a "constantly forking" project, and that's how
    Linus> development very fundamentally happens.

    Linus> And a fork is a lot more scalable than a branch.  It's also

There's no difference, unless by "branch" and "fork" you mean the
corresponding implementations in CVS and BK of one and the same
development model.

    Linus> I would like to point out that Linux development has scaled
    Linus> a lot better than gcc, to a larger source base (it's 5+ M
    Linus> lines) with much more fundamental programming issues
    Linus> (concurrency etc).  I will bet you that the Linux kernel
    Linus> merges are a lot bigger than the gcc ones, that development
    Linus> happens faster, and that there are more independent
    Linus> developers working on their own versions of Linux than
    Linus> there are of gcc.

How about a different view on the subject ?

IMHO a good metric of the complexity of a particular problem/domain is
the overall ability of the mankind to cope with it.

Thus, what you describe, may be due to the fact that people capable of
kernel programming are a lot more than people capable of compiler
programming, IOW, that most kernel programming requires rather basic
programming knowledge, compared to most compilers programming.

No ?

    Linus> Any source control system that has "write access" issues is
    Linus> fundamentally broken in my opinion.  Your repository should
    Linus> be _yours_, and nobody elses.  There is no "write access".
    Linus> There is only some way to expedite a merge between two
    Linus> repositories.  The source control management should make it
    Linus> easy for you to export your changes to other repositories.

A SCM should facilitate collaboration.  Any SCM that requires single
person's permission for modifications to the source base (e.g. by
having only private repositories) is broken beyond repair and scalable
exactly like a BitKeeper^WBKL.

~velco

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-14 15:33                     ` Momchil Velikov
@ 2002-12-14 16:06                       ` Linus Torvalds
  2002-12-15  3:59                         ` Momchil Velikov
                                           ` (2 more replies)
  0 siblings, 3 replies; 60+ messages in thread
From: Linus Torvalds @ 2002-12-14 16:06 UTC (permalink / raw)
  To: Momchil Velikov; +Cc: zack, gcc

On 15 Dec 2002, Momchil Velikov wrote:
>
> I surely have a "fork" of GCC and I ain't got nobody's permission.
> Permission is needed not when forking, but when merging.

But the point is, the "CVS mentality" means that a fork is harder to merge
than a branch, and you often lose all development history when you merge a
fork as a result of this (yeah, you can do a _lot_ of work, and try to
also merge the SCM information on a fork merge, but it's almost never done
because it is so painful).

That's why I think the CVS mentality sucks. You have only branches that
are "first-class" citizens, and they need write permission to create and
are very expensive to create. Note: I'm not saying they are slow - that's
just a particular CVS implementation detail. By "expensive" I mean that
they cannot easily be created and thrown away, so with the "CVS mentality"
those branches only get created for "big" things.

And the "cheap" branches (private check-outs that don't need write
permissions and can be done by others) lose all access to real source
control except the ability to track the original. Two of the cheap
branches cannot track each other in any sane way. And they have no
revision history at all even internally.

Yet it is the _cheap_ branches that should be the true first-class
citizen. Potentially throw-away code that may end up being really really
useful, but might just be a crazy pipe-dream. The experimental stuff that
would _really_ want to have nice source control.

And the "CVS mentality" totally makes that impossible. Subversion seems to
be only a "better CVS", and hasn't gotten away from that mentality, which
is sad.

>     Linus> And a fork is a lot more scalable than a branch.  It's also
>
> There's no difference, unless by "branch" and "fork" you mean the
> corresponding implementations in CVS and BK of one and the same
> development model.

Basically, by "branch" I mean something that fundamentally is part of the
"official site". If a branch has to be part of the official site, then a
branch is BY DEFINITION useless for 99% of developers. Such branches
SHOULD NOT EXIST, since they are fundamentally against the notion of open
development!

A "fork" is something where people can just take the tree and do their own
thing to it. Forking simply doesn't work with the CVS mentality, yet
forking is clearly what true open development requires.

> IMHO a good metric of the complexity of a particular problem/domain is
> the overall ability of the mankind to cope with it.
>
> Thus, what you describe, may be due to the fact that people capable of
> kernel programming are a lot more than people capable of compiler
> programming, IOW, that most kernel programming requires rather basic
> programming knowledge, compared to most compilers programming.
>
> No ?

No.

Sure, you can want to live in your own world, and try to keep the
riff-raff out. That's the argument I hear from a lot of commercial
developers ("we don't want random hackers playing with our code, we don't
believe they can do as good a job as our highly paid professionals").

The argument is crap. It was crap for the kernel, it's crap for gcc. The
only reason you think "anybody" can program kernels is the fact that Linux
has _shown_ that anybody can do so. If gcc had less of a restrictive model
for accepting patches, you'd have a lot more random people who would do
them, I bet. But gcc development not only has the "CVS mentality", it has
the "FSF disease" with the paperwork crap and copyright assignment crap.

So you keep outsiders out, and then you say it's because they couldn't do
what you can do anyway.

Crap crap crap arguments. Trust me, there are more intelligent people out
there than you believe, and they can do a hell of a lot better work than
you currently allow them to do. Often with very little formal schooling.

> A SCM should facilitate collaboration.  Any SCM that requires single
> person's permission for modifications to the source base (e.g. by
> having only private repositories) is broken beyond repair and scalable
> exactly like a BitKeeper^WBKL.

But you don't _undestand_. BK allows hundreds of people to work on the
same repository, if you want to. You just give them BK accounts on the
machine, the same way you do with CVS.

But that's not the scalable way to do things. The _scalable_ thing is to
let everybody have their own tree, and _not_ have that "one common point"
disease. You have the networking people working on their networking trees
_without_ merging back to me, because they have their own development
branches that simply aren't ready yet, for example. Having a single point
for work like that is WRONG, and it's definitely _not_ scalable.

		Linus

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-14 16:06                       ` Linus Torvalds
@ 2002-12-15  3:59                         ` Momchil Velikov
  2002-12-15  8:26                         ` Momchil Velikov
  2002-12-15 17:09                         ` Stan Shebs
  2 siblings, 0 replies; 60+ messages in thread
From: Momchil Velikov @ 2002-12-15  3:59 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: zack, gcc

>>>>> "Linus" == Linus Torvalds <torvalds@transmeta.com> writes:

    >> A SCM should facilitate collaboration.  Any SCM that requires
    >> single person's permission for modifications to the source base
    >> (e.g. by having only private repositories) is broken beyond
    >> repair and scalable exactly like a BitKeeper^WBKL.

    Linus> But you don't _undestand_. BK allows hundreds of people to
    Linus> work on the same repository, if you want to. You just give
    Linus> them BK accounts on the machine, the same way you do with
    Linus> CVS.

Ah, I _do_ understand that this is possible.  I _do_ understand very
well that there's no "Linux Kernel Project", but there is "Linus's
kernel tree" .  You seem to not understand that there _is_ "GCC
Project" as well as "GNU Project".

    Linus> But that's not the scalable way to do things. The
    Linus> _scalable_ thing is to let everybody have their own tree,

In that case you don't need a SCM at all - you can do pretty well with
few simple utilities to maintain a number of hardlinked trees.

  "cp -Rl" - branch, tag, fork, whatever
  "share <src> <dst>" - make identical files hardlinks
  "unshare <path>"    - make <path> a file with one link (recursively)
  "dmerge <old> <new> <mine> - same as merge(1), but for trees
  "diff -r" - make a changeset
  "mail" - send a changeset
  "patch" - apply a changeset
  "rm -rf" - transaction rollback, so we have atomicity, see :)

    Linus> and _not_ have that "one common point" disease. You have

This is not a disease, it is a _natural_ consequence of
_collaboration_. And collaboration is an _absolute necessity_ when you
are above certain degree of coupling between the components.  A change
in the network stack can hardly affect the operation of the ATA driver
- however this is not the case in GCC [1].  Changes in particular
phase _do_ affect other phases and this is not a coincidence - it is a
consequence from the fact that GCC components are tightly coupled by
the virtue of working on a common data structure.

The degree module coupling can be characterized as follows [2] (from
loose to tight):

Degree                  Description
------                  -----------
0                       Independent - no coupling

1                       Data coupling - interaction between the
                        modules is with simple, unstructured data
                        types, via interface functions.

3                       Template coupling - interface function
                        parameters include structured data types.

4                       Common data - when modules share common data
                        structure.

5                       Control - when one module controls others with
                        flags, switches, command codes, etc.

The Linux kernel tends to be in {0, 1, 3}.  GCC tends to be in {4, 5}.

IOW, GCC components are roughly from three to five times more tighly
coupled than Linux kernel components.

My point is that the Linux kernel development model [3], while
obviously successful, it not necessarily adequate for other projects,
particularly for GCC.

~velco


[1] AFAICT, I'm not a GCC developer.

[2] there may be other mertics, I've found that one adequate, though
    YMMV.

[3] And, yes, I claim I fully understand it, at least I fully
    understand what _you_ want it to be.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-14 16:06                       ` Linus Torvalds
  2002-12-15  3:59                         ` Momchil Velikov
@ 2002-12-15  8:26                         ` Momchil Velikov
  2002-12-15 12:02                           ` Linus Torvalds
  2002-12-15 17:09                         ` Stan Shebs
  2 siblings, 1 reply; 60+ messages in thread
From: Momchil Velikov @ 2002-12-15  8:26 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: zack, gcc

>>>>> "Linus" == Linus Torvalds <torvalds@transmeta.com> writes:

    Linus> Crap crap crap arguments. Trust me, there are more
    Linus> intelligent people out there than you believe, and they can
    Linus> do a hell of a lot better work than you currently allow
    Linus> them to do. Often with very little formal schooling.

But yes, there are lots of intelligent people out there, but while
intelligence is usually sufficient for working on a kernel, working on
a compiler requires _knowledge_ (no matter formal or not).

~velco

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-15  8:26                         ` Momchil Velikov
@ 2002-12-15 12:02                           ` Linus Torvalds
  2002-12-15 14:16                             ` Momchil Velikov
  0 siblings, 1 reply; 60+ messages in thread
From: Linus Torvalds @ 2002-12-15 12:02 UTC (permalink / raw)
  To: Momchil Velikov; +Cc: zack, gcc

On 15 Dec 2002, Momchil Velikov wrote:
> >>>>> "Linus" == Linus Torvalds <torvalds@transmeta.com> writes:
>
>     Linus> Crap crap crap arguments. Trust me, there are more
>     Linus> intelligent people out there than you believe, and they can
>     Linus> do a hell of a lot better work than you currently allow
>     Linus> them to do. Often with very little formal schooling.
>
> But yes, there are lots of intelligent people out there, but while
> intelligence is usually sufficient for working on a kernel, working on
> a compiler requires _knowledge_ (no matter formal or not).

Blaah. I _bet_ that is not true.

I actually had my own gcc tree for Linux kernel development back when I
started, mostly because I just enjoyed it and found the compiler
interesting. I added builtins for things like memcpy() etc because I cared
(and it was more fun that writing assembly language library routines), and
because gcc at that time didn't have hardly any support for things like
that.

I didn't understand the whole compiler, BUT THAT DID NOT MATTER. The same
way that most Linux kernel developers don't understand the whole kernel,
and do not even need to.  Sure, you need people with specialized knowledge
for specialized areas (designing the big picture etc), but that's a small
small part of it.

To paraphrase, programming is often 1% inspiration and 99% perspiration.

In short, your argument is elitist and simply _wrong_.  It's true that to
create a whole compiler you need a whole lot of knowledge, but that's true
of any project - including operating systems. But that doesn't matter,
because there isn't "one" person who needs to know everything.

				Linus

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-15 12:02                           ` Linus Torvalds
@ 2002-12-15 14:16                             ` Momchil Velikov
  2002-12-15 15:20                               ` Pop Sébastian
  0 siblings, 1 reply; 60+ messages in thread
From: Momchil Velikov @ 2002-12-15 14:16 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: zack, gcc

>>>>> "Linus" == Linus Torvalds <torvalds@transmeta.com> writes:

    Linus> In short, your argument is elitist and simply _wrong_.

*shrug* That's my explanation to what I observe - more people develop
kernels than compilers.  Particular compiler's development model or
patch review and acceptance policy do not matter at all - if they are
an obstacle people's creativity would be redirected somewhere else.

I may be wrong.  But I'm yet to hear a more credible explanation for
this simple fact.

~velco

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-15 14:16                             ` Momchil Velikov
@ 2002-12-15 15:20                               ` Pop Sébastian
  2002-12-15 16:09                                 ` Linus Torvalds
  0 siblings, 1 reply; 60+ messages in thread
From: Pop Sébastian @ 2002-12-15 15:20 UTC (permalink / raw)
  To: Momchil Velikov; +Cc: Linus Torvalds, zack, gcc

On Sun, Dec 15, 2002 at 11:41:04PM +0200, Momchil Velikov wrote:
> >>>>> "Linus" == Linus Torvalds <torvalds@transmeta.com> writes:
> 
>     Linus> In short, your argument is elitist and simply _wrong_.
> 
> *shrug* That's my explanation to what I observe - more people develop
> kernels than compilers.  Particular compiler's development model or
> patch review and acceptance policy do not matter at all - if they are
> an obstacle people's creativity would be redirected somewhere else.
> 
> I may be wrong.  But I'm yet to hear a more credible explanation for
> this simple fact.
> 
Maybe it's true because for writing compiler optimizations one 
should have some knowledge in mathematics.  Most of the new techniques 
developped for optimizing compilers use abstract representations 
based on mathematical objects (such as graphs, lattices, vectorial spaces, 
polyhedras, ...)

Maybe we're wrong but the percentage of mathematicians who contribute 
to GCC could be slightly bigger than for LK.

Sebastian

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-15 15:20                               ` Pop Sébastian
@ 2002-12-15 16:09                                 ` Linus Torvalds
  2002-12-15 16:49                                   ` Bruce Stephens
  2002-12-16 17:22                                   ` source mgt. requirements solicitation Mike Stump
  0 siblings, 2 replies; 60+ messages in thread
From: Linus Torvalds @ 2002-12-15 16:09 UTC (permalink / raw)
  To: Pop Sébastian; +Cc: Momchil Velikov, zack, gcc

On Sun, 15 Dec 2002, Pop Sébastian wrote:
> >
> > I may be wrong.  But I'm yet to hear a more credible explanation for
> > this simple fact.
> >
> Maybe it's true because for writing compiler optimizations one
> should have some knowledge in mathematics.

Naah. It's simple - kernels are just sexier.

Seriously, I think it's just that a kernel tends to have more different
_kinds_ of problems, and thus tend to attract different kinds of people,
and more of them.

Compilers are complicated, no doubt about that, but the complicated stuff
tends to be mostly of the same type (ie largely fairly algorithmic
transformations for the different optimization passes). In kernels, you
have many _different_ kinds of issues, and as a result you'll find more
people who are interested in one of them. So you'll find people who care
about filesystems, or people who care about memory management, or people
who find it interesting to do concurrency work or IO paths.

That is obviously also why the kernel ends up being a lot of lines of
code. I think it's about an order of magnitude bigger in size than all of
gcc - not because it is an order of magnitude more complex, obviously, but
simply because it has many more parts to it. And that directly translates
to more pieces that people can cut their teeth on.

> Maybe we're wrong but the percentage of mathematicians who contribute
> to GCC could be slightly bigger than for LK.

I don't think you're wrong per se. The "transformation" kind of code is
just much more common in a compiler, and the kind of people who work on it
are more likely to be the mathematical kind of people. It's not the only
part of gcc, obviously (I think parsing is underrated, and I'm happy that
the preprocessing front-end has gotten so much attention in the last few
years), but it's one of the bigger parts.

And people clearly seek out projects that satisfy their interests.

		Linus

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-15 16:09                                 ` Linus Torvalds
@ 2002-12-15 16:49                                   ` Bruce Stephens
  2002-12-15 16:59                                     ` Linus Torvalds
  2002-12-16 17:22                                   ` source mgt. requirements solicitation Mike Stump
  1 sibling, 1 reply; 60+ messages in thread
From: Bruce Stephens @ 2002-12-15 16:49 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Pop Sébastian, Momchil Velikov, zack, gcc

Linus Torvalds <torvalds@transmeta.com> writes:

[...]

> That is obviously also why the kernel ends up being a lot of lines of
> code. I think it's about an order of magnitude bigger in size than all of
> gcc - not because it is an order of magnitude more complex, obviously, but
> simply because it has many more parts to it. And that directly translates
> to more pieces that people can cut their teeth on.

The gcc tree I have seems to have 4145483 lines, whereas the 2.4.20
kernel seems to have 4841227 lines.  (Not lines of code; this includes
all files in the unbuilt tree (including CVS directories for gcc,
although this is probably trivial), and it includes comments and files
which are not code.  In the gcc case, it may include some generated
files; I'm not sure how Ada builds nowadays.)

Excluding the gcc testsuites, gcc has 3848080 lines.  So gcc (the
whole of gcc, with all its languages) seems to be a bit smaller than
the kernel, but probably not by an order of magnitude.

This is reenforced by "du -s": the gcc tree takes up 187144K, the
kernel takes up 170676K.  None of this is particularly precise,
obviously, but it points to the two projects (with all their combined
bits) being not too dissimilar in size.  Which is a possibly
interesting coincidence.  (The 2.5 kernel may be much bigger; I
haven't looked.  The tarballs don't look *that* much bigger, however.)

[...]

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-15 16:49                                   ` Bruce Stephens
@ 2002-12-15 16:59                                     ` Linus Torvalds
  2002-12-15 18:10                                       ` Bruce Stephens
  2002-12-16  8:32                                       ` Diego Novillo
  0 siblings, 2 replies; 60+ messages in thread
From: Linus Torvalds @ 2002-12-15 16:59 UTC (permalink / raw)
  To: Bruce Stephens; +Cc: Pop Sébastian, Momchil Velikov, zack, gcc



On Mon, 16 Dec 2002, Bruce Stephens wrote:
>
> The gcc tree I have seems to have 4145483 lines

Hmm, might be my mistake. I only have an old and possibly pared-down tree
online. However, I also counted lines differently: I only counted *.[chS]
files, and you may have counted everything (the gcc changelogs and .texi
files in particular are _huge_ if you have a full complement of them
there).

What does "find . -name '*.[chS]' | xargs cat | wc" say?

(But you're right - I should _at_least_ count the .md files too, so my
count was at least as bogus as I suspect yours was)

			Linus

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-15 16:59                                     ` Linus Torvalds
@ 2002-12-15 18:10                                       ` Bruce Stephens
  2002-12-16  8:32                                       ` Diego Novillo
  1 sibling, 0 replies; 60+ messages in thread
From: Bruce Stephens @ 2002-12-15 18:10 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Pop Sébastian, Momchil Velikov, zack, gcc

Linus Torvalds <torvalds@transmeta.com> writes:

> On Mon, 16 Dec 2002, Bruce Stephens wrote:
>>
>> The gcc tree I have seems to have 4145483 lines
>
> Hmm, might be my mistake. I only have an old and possibly pared-down tree
> online. However, I also counted lines differently: I only counted *.[chS]
> files, and you may have counted everything (the gcc changelogs and .texi
> files in particular are _huge_ if you have a full complement of them
> there).

The ChangeLog files give a total of 306201 lines.  texi files (and
info files) add another 426482.  So that's a lot, yes.  (In terms of
the size of the project, probably the texi files at least ought to be
counted, just as the stuff in Documentation ought to be counted in
some way for the Linux kernel.  But not the generated .info files.)

> What does "find . -name '*.[chS]' | xargs cat | wc" say?

1445809 5700810 43690421

But that doesn't include the ada or java files (or the C++ standard
library).  Quite possibly it doesn't include some Objective C runtime
source, too.

> (But you're right - I should _at_least_ count the .md files too, so
> my count was at least as bogus as I suspect yours was)

Sure.  It's all pretty meaningless---I think the two projects happen
to be approximately the same size (with the Linux kernel bigger), but
I don't think it's anything other than coincidence.  gcc/ada accounts
for about 800K lines, for example, and that's relatively recent, IIRC.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-15 16:59                                     ` Linus Torvalds
  2002-12-15 18:10                                       ` Bruce Stephens
@ 2002-12-16  8:32                                       ` Diego Novillo
  2002-12-17  3:36                                         ` Pop Sébastian
  1 sibling, 1 reply; 60+ messages in thread
From: Diego Novillo @ 2002-12-16  8:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Bruce Stephens, Pop Sébastian, Momchil Velikov, zack, gcc

On Sun, 15 Dec 2002, Linus Torvalds wrote:

> On Mon, 16 Dec 2002, Bruce Stephens wrote:
> >
> > The gcc tree I have seems to have 4145483 lines
> 
> Hmm, might be my mistake. I only have an old and possibly pared-down tree
> online. However, I also counted lines differently: I only counted *.[chS]
> files, and you may have counted everything (the gcc changelogs and .texi
> files in particular are _huge_ if you have a full complement of them
> there).
> 
Output of sloccount on a relatively recent snapshot:

-----------------------------------------------------------------------------
SLOC    Directory       SLOC-by-Language (Sorted)
1274221 gcc             ansic=839349,ada=298101,cpp=73596,yacc=23251,asm=20244,
                        fortran=6934,exp=4706,sh=4430,objc=2751,lex=559,perl=189,awk=111
225571  libjava         java=131300,cpp=65054,ansic=27198,exp=1213,perl=782,
                        awk=24
67452   libstdc++-v3    cpp=49425,ansic=17270,sh=525,exp=193,awk=39
34729   boehm-gc        ansic=25682,sh=7631,cpp=972,asm=444
21798   libiberty       ansic=21495,perl=283,sed=20
11657   top_dir         sh=11657
10376   libbanshee      ansic=10376
10358   libf2c          ansic=10037,fortran=321
9581    zlib            ansic=8309,asm=712,cpp=560
8904    libffi          ansic=5545,asm=3359
8002    libobjc         ansic=7233,objc=397,cpp=372
3721    contrib         cpp=2306,sh=935,perl=324,awk=67,lisp=59,ansic=30
3074    libmudflap      ansic=3074
2506    fastjar         ansic=2325,sh=181
1463    include         ansic=1432,cpp=31
667     maintainer-scripts sh=667
0       config          (none)
0       CVS             (none)
0       INSTALL         (none)


Totals grouped by language (dominant language first):
ansic:       979355 (57.81%)
ada:         298101 (17.60%)
cpp:         192316 (11.35%)
java:        131300 (7.75%)
sh:           26026 (1.54%)
asm:          24759 (1.46%)
yacc:         23251 (1.37%)
fortran:       7255 (0.43%)
exp:           6112 (0.36%)
objc:          3148 (0.19%)
perl:          1578 (0.09%)
lex:            559 (0.03%)
awk:            241 (0.01%)
lisp:            59 (0.00%)
sed:             20 (0.00%)


Total Physical Source Lines of Code (SLOC)                = 1,694,080
Development Effort Estimate, Person-Years (Person-Months) = 491.37 (5,896.47)
 (Basic COCOMO model, Person-Months = 2.4 * (KSLOC**1.05))
Schedule Estimate, Years (Months)                         = 5.64 (67.73)
 (Basic COCOMO model, Months = 2.5 * (person-months**0.38))
Estimated Average Number of Developers (Effort/Schedule)  = 87.06
Total Estimated Cost to Develop                           = $ 66,377,705
 (average salary = $56,286/year, overhead = 2.40).
SLOCCount is Open Source Software/Free Software, licensed under the FSF GPL.
Please credit this data as "generated using 'SLOCCount' by David A. Wheeler."
-----------------------------------------------------------------------------


Diego.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-16  8:32                                       ` Diego Novillo
@ 2002-12-17  3:36                                         ` Pop Sébastian
  2002-12-17 13:14                                           ` Tom Lord
  0 siblings, 1 reply; 60+ messages in thread
From: Pop Sébastian @ 2002-12-17  3:36 UTC (permalink / raw)
  To: Diego Novillo; +Cc: Linus Torvalds, Bruce Stephens, Momchil Velikov, zack, gcc

For comparison I've ran sloccount on LK:

$ sloccount ./linux-2.5.52
[...]

SLOC    Directory       SLOC-by-Language (Sorted)
1664092 drivers         ansic=1659643,asm=1949,yacc=1177,perl=813,lex=352,
                        sh=158
678895  arch            ansic=507796,asm=170311,sh=624,awk=119,perl=45
365490  include         ansic=364696,cpp=794
340797  fs              ansic=340797
261122  sound           ansic=260940,asm=182
193052  net             ansic=193052
14814   kernel          ansic=14814
13523   mm              ansic=13523
11086   scripts         ansic=6830,perl=1339,cpp=1218,yacc=531,tcl=509,lex=359,
                        sh=285,awk=15
6988    crypto          ansic=6988
6083    lib             ansic=6083
2740    ipc             ansic=2740
1787    init            ansic=1787
1748    Documentation   sh=898,ansic=567,lisp=218,perl=65
1081    security        ansic=1081
119     usr             ansic=119
0       top_dir         (none)


Totals grouped by language (dominant language first):
ansic:      3381456 (94.89%)
asm:         172442 (4.84%)
perl:          2262 (0.06%)
cpp:           2012 (0.06%)
sh:            1965 (0.06%)
yacc:          1708 (0.05%)
lex:            711 (0.02%)
tcl:            509 (0.01%)
lisp:           218 (0.01%)
awk:            134 (0.00%)




Total Physical Source Lines of Code (SLOC)                = 3,563,417
Development Effort Estimate, Person-Years (Person-Months) = 1,072.73 (12,872.75)
 (Basic COCOMO model, Person-Months = 2.4 * (KSLOC**1.05))
Schedule Estimate, Years (Months)                         = 7.59 (91.12)
 (Basic COCOMO model, Months = 2.5 * (person-months**0.38))
Estimated Average Number of Developers (Effort/Schedule)  = 141.27
Total Estimated Cost to Develop                           = $ 144,911,083
 (average salary = $56,286/year, overhead = 2.40).
SLOCCount is Open Source Software/Free Software, licensed under the FSF GPL.
Please credit this data as "generated using 'SLOCCount' by David A. Wheeler."

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-17  3:36                                         ` Pop Sébastian
@ 2002-12-17 13:14                                           ` Tom Lord
  2002-12-17 15:28                                             ` Itching and scratching (Re: source mgt. requirements solicitation) Stan Shebs
  0 siblings, 1 reply; 60+ messages in thread
From: Tom Lord @ 2002-12-17 13:14 UTC (permalink / raw)
  To: pop; +Cc: dnovillo, torvalds, bruce, velco, zack, gcc

	GCC: 

	Total Estimated Cost to Develop 	= $ 66,377,705

	LK:

	Total Estimated Cost to Develop 	= $ 144,911,083

and:

	(average salary = $56,286/year, overhead = 2.40).

(That's an appallingly low average salary, btw., and a needlessly
large overhead.  If we're thinking of a nearly 20 year average, maybe
it's not _too_ badly removed from reality, but it's not a realistic
basis for planning moving forward.)

Someone did a sloccount run on a bunch of my 1-man-effort software,
developed over about 10 calendar years, and the person-years count was
surprisingly accurate.

In general, there is something of a business crisis in the free
software world.  It's particularly noticable around businesses based
on linux distributions.

Those distributions represent a huge amount of unpaid work.
Businesses using them got some free help bootstrapping themselves into
now favorable positions.  So, not only did they get the unpaid work
for free (as in beer), they traded that for favorable market positions
that raise the barrier of entry to new competitors.  While in theory
"anyone" can start selling their own distro, in reality, there's only
a few established companies and investors with deep pockets who have
any chance in this area.

So what's the crisis?  Well, those freeloaders aren't exactly being
agressive about figuring out how to sustain the free software movement
with R&D investment.  Companies spend a little on public projects,
sure, but you can count the number of employees participating,
industry wide, on the fingers of a few 10s of people and (total,
industry-wide) corporate donations to code-generating individuals and
NPOs with no more than 7 significant digits per year.  When they do
spend on public projects, it is most often for very narrow tactical
purposes -- not to make the ecology of projects healthier overall.  In
significant proportions, they spend R&D money on entirely in-house
projects that, while rooted in free software, benefit nobody but the
companies themselves.

You know, it's easy to make a few quarters for your business unit when
you, in essence, cheat.

So the crisis is that in the medium term, as engineering businesses
go, these aren't sustainable models.  And when they start leading
volunteers and soaking up volunteer work for their own aims, and
capturing mind-share in the press, one has to start to wonder whether
they aren't, overall, doing more harm than good.  And then there's
some social justice and labor issues....

Bill Gates, when he says that free software is a threat to innovation,
is currently correct.  UnAmerican?  You bet!

And, btw, surprise!: In the free software world, corporate GCC hackers
are the relative fat cats.  Go figure.

-t

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Itching and scratching (Re: source mgt. requirements solicitation)
  2002-12-17 13:14                                           ` Tom Lord
@ 2002-12-17 15:28                                             ` Stan Shebs
  2002-12-17 16:07                                               ` Tom Lord
  0 siblings, 1 reply; 60+ messages in thread
From: Stan Shebs @ 2002-12-17 15:28 UTC (permalink / raw)
  To: Tom Lord; +Cc: gcc

Tom Lord wrote:

>
>And, btw, surprise!: In the free software world, corporate GCC hackers
>are the relative fat cats.  Go figure.
>
That's because GCC hackers are doing things that are worth serious
amounts of money to people that have it to spend.  Apple has signed
up with GCC because it solves more of Apple's problems more cheaply
than the several proprietary possibilities, and having made it part
of Mac OS X, Apple's overall corporate health is now partly dependent
on GCC continuing to be a good compiler, and on fixing remaining
problems, such as slowness.

If you were able to convince Apple mgmt that you could make GCC
10x faster not using precompiled headers, I think you could name
your price and get hired the same day; that's how important the
problem is to Apple.  (You're going to have to be really convincing
though; our mgmt has listened to a hundred pitches already.)

Speaking more generally, the folks that get paid to do free software
are the ones who are solving the problems of people with the money.
It's up to us to be clever enough to figure out to solve the specific
problems in a way that improves architecture and infrastructure.
That was a key but underappreciated aspect of Cygnus' development
contracts; we would always try to go after projects that included
infrastructure improvement, but if necessary we would do something
that was random but lucrative and use the profits to pay for
generic work.

To put it more simply, find a rich person with an itch, and offer
to scratch it for them. :-)

Stan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Itching and scratching (Re: source mgt. requirements solicitation)
  2002-12-17 15:28                                             ` Itching and scratching (Re: source mgt. requirements solicitation) Stan Shebs
@ 2002-12-17 16:07                                               ` Tom Lord
  2002-12-17 15:46                                                 ` Stan Shebs
  0 siblings, 1 reply; 60+ messages in thread
From: Tom Lord @ 2002-12-17 16:07 UTC (permalink / raw)
  To: shebs; +Cc: gcc



       > And, btw, surprise!: In the free software world, corporate
       > GCC hackers are the relative fat cats.  Go figure.

       That's because GCC hackers are doing things that are worth
       serious amounts of money to people that have it to spend.

I didn't mean that it is wrong for you to be well paid.  I meant that
you have a lot of clout.

	Speaking more generally, the folks that get paid to do free
	software are the ones who are solving the problems of people
	with the money.  It's up to us to be clever enough to figure
	out to solve the specific problems in a way that improves
	architecture and infrastructure.  That was a key but
	underappreciated aspect of Cygnus' development contracts; we
	would always try to go after projects that included
	infrastructure improvement, but if necessary we would do
	something that was random but lucrative and use the profits to
	pay for generic work.

Was it customers who underappreciated that?  or was that a selling
point?

	To put it more simply, find a rich person with an itch, and
	offer to scratch it for them. :-)

Ah, well, "The Cabots speak only to Lodges and the Lodges speak only
to God."

-t

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Itching and scratching (Re: source mgt. requirements solicitation)
  2002-12-17 16:07                                               ` Tom Lord
@ 2002-12-17 15:46                                                 ` Stan Shebs
  0 siblings, 0 replies; 60+ messages in thread
From: Stan Shebs @ 2002-12-17 15:46 UTC (permalink / raw)
  To: Tom Lord; +Cc: gcc

Tom Lord wrote:

>
>	Speaking more generally, the folks that get paid to do free
>	software are the ones who are solving the problems of people
>	with the money.  It's up to us to be clever enough to figure
>	out to solve the specific problems in a way that improves
>	architecture and infrastructure.  That was a key but
>	underappreciated aspect of Cygnus' development contracts; we
>	would always try to go after projects that included
>	infrastructure improvement, but if necessary we would do
>	something that was random but lucrative and use the profits to
>	pay for generic work.
>
>Was it customers who underappreciated that?  or was that a selling
>point?
>
Sometimes it was a selling point, sometimes the concept was too subtle
for the customer to grasp.  In the mid-90s, a good percentage of time
still had to be spent explaining free software, reassuring people that
GCC didn't cause its output to be GPLed, etc.  It was interesting to see
how much variation there was among customers, and also how important it
was to have actual sales people in the process - engineers left to
themselves would rathole on side issues and never get around to the
actual dealmaking.

Stan


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-15 16:09                                 ` Linus Torvalds
  2002-12-15 16:49                                   ` Bruce Stephens
@ 2002-12-16 17:22                                   ` Mike Stump
  1 sibling, 0 replies; 60+ messages in thread
From: Mike Stump @ 2002-12-16 17:22 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Pop Sébastian, Momchil Velikov, zack, gcc

On Sunday, December 15, 2002, at 03:45 PM, Linus Torvalds wrote:
> That is obviously also why the kernel ends up being a lot of lines of
> code. I think it's about an order of magnitude bigger in size than all 
> of
> gcc

bash-2.05a$ find gcc -type f -print | xargs cat | wc -l
  4084979

[ ducking ]

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-14 16:06                       ` Linus Torvalds
  2002-12-15  3:59                         ` Momchil Velikov
  2002-12-15  8:26                         ` Momchil Velikov
@ 2002-12-15 17:09                         ` Stan Shebs
  2 siblings, 0 replies; 60+ messages in thread
From: Stan Shebs @ 2002-12-15 17:09 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Momchil Velikov, zack, gcc

Linus Torvalds wrote:

>[...]
>
> If gcc had less of a restrictive model
>for accepting patches, you'd have a lot more random people who would do
>them, I bet.
>
I can assure you that there are lots of random GCC patches and forks
out there, some of them drastically divergent from the main version.
(I myself have been responsible for a few of them.)

Nobody is being stopped from forking GCC and promoting their own
versions.  A large number of GCC developers have chosen to cooperate
more closely on a single tree because we've empirically determined
that we get a better quality compiler that way.  Choice of source
management systems is a minor detail, not a make-or-break issue.

>But gcc development not only has the "CVS mentality", it has
>the "FSF disease" with the paperwork crap and copyright assignment crap.
>
If AT&T had come down on GNU in the 80s the way that they did on
BSD in the early 90s, you wouldn't have had any software to go
with your kernel.  RMS is much smarter than you seem to think.

Stan


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-09 17:05         ` Walter Landry
  2002-12-09 17:10           ` Joseph S. Myers
@ 2002-12-09 17:50           ` Zack Weinberg
  1 sibling, 0 replies; 60+ messages in thread
From: Zack Weinberg @ 2002-12-09 17:50 UTC (permalink / raw)
  To: gcc

I'm keeping around a lot of context.  Scroll down.

>> > > 0d. The data stored in the repository cannot be modified by
>> > >     unprivileged local users except by going through the version
>> > >     control system.  Presently I could take 'vi' to one of the ,v
>> > >     files in /cvs/gcc and break it thoroughly, or sneak something into
>> > >     the file content, and leave no trace.
>> > 
>> > There is no interaction with root, so if you own the archive, you can
>> > always do what you want.  To get anything approaching this, you have
>> > to deal with PGP signatures, SHA hashes, and the like.  OpenCM is
>> > probably the only group (including BitKeeper) that even comes close to
>> > doing this right.
>> 
>> This sort of thing has been done simply by a modified setuid (to a cvs
>> user, not root) cvs binary so users can't access the repository directly,
>> only through that binary.  More generically, with a reasonable protocol
>> for local repository access it should be possible to use GNU userv to
>> separate the repository from the users.
>
> This is a different security model.  Arch is secure because it doesn't
> depend on having priviledged access.  For example, there is an "rm
> -rf" command built into arch.
>
> I have a feeling that you are thinking of how CVS handles things, with
> a centralized server.  Part of the whole point of arch is that there
> is no centralized server.  So, for example, I can develop arch
> independently of whether Tom thinks that I am worthy enough to do so.
> I can screw up my archive as much as I want (and I have), and Tom can
> be blissfully unaware.  Easy merging is what makes this possible.
>
> So you don't, in general, have a repository that is writeable by more
> than one person.

Let me be specific about the problem I'm worried about.

As Joseph pointed out, GCC development is and will be centered around
a 'master' server.  If we wind up using a distributed system,
individual developers will take advantage of it to do offline work,
but the master repository will still act as a communication nexus
between us all, and official releases will be cut from there.  I doubt
anyone will do releases except from there.[1]  The security of this
master server is mission-critical.

The present situation, with CVS pserver enabled for read-only
anonymous access, and write privilege available via CVS-over-ssh, has
two potentially exploitable vulnerabilities that should be easy to
address in a new system.

_Imprimis_, the CVS pserver requires write privileges on the CVS
repository directories, even if it is providing only read access.
Therefore, if the 'anoncvs' user is somehow compromised -- for
instance, by a buffer overflow bug in the pserver itself -- the
attacker could potentially modify any of the ,v files stored in the
repository.  This was what I was talking about with my point 0c.  It
sounds like all the replacements for CVS have addressed this, by
allowing the anoncvs-equivalent server process to run as a user that
doesn't have OS-level write privileges on the repository.

_Secundus_, CVS-over-ssh operates by invoking 'cvs server' on the
repository host -- running under the user ID of the invoker, who must
have an account on the repository host.  It can't perform any
operations that the invoking user can't.  Which means that the
invoking user must also have OS-level write privileges on the
repository.  Now, such users are _supposed_ to be able to check in
changes to the repository, but they _aren't_ supposed to be able to
modify the ,v files with a text editor.  The distinction is crucial.
If the account of a user with write privileges is compromised, and
used to check in a malicious change, the version history is intact,
the change will be easily detected, and we can simply back out the
malice. If the account of a user with write privileges is compromised
and used to hand-edit a malicious change into a ,v file, it's quite
that this will go undetected until after half the binaries on the
planet are untrustworthy.  It is this latter scenario I would like to
be impossible.

There are several possible ways to do that.  One way is the way
Perforce does it: _all_ access, even local access, goes through p4d,
and p4d can run under its own user ID and be the only user ID with
write access to the repository.  Another way, and perhaps a cleverer
one, is OpenCM's way, where the (SHA of) the file content is the
file's identity, so a malicious change will not even be picked up.
(Please correct me if I misunderstand.)  Of course, that provides no
insulation against an attacker using a compromised account to execute
"rm -fr /path/to/repository", but *that* problem is best solved with
backups, because a disk failure could have the same effect and there's
nothing software can do about that.

zw

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-09 15:27       ` Joseph S. Myers
  2002-12-09 17:05         ` Walter Landry
@ 2002-12-11  1:11         ` Branko Čibej
  1 sibling, 0 replies; 60+ messages in thread
From: Branko Čibej @ 2002-12-11  1:11 UTC (permalink / raw)
  To: Joseph S. Myers; +Cc: gcc

Joseph S. Myers wrote:

>On Mon, 9 Dec 2002, Walter Landry wrote:
>
>  
>
>>arch doesn't interact at all with root.  The remote repositories are
>>all done with sftp, ftp, and http, which is as secure as those servers
>>are.
>>    
>>
>
>Is this - for anonymous access - _plain_ HTTP, or HTTP + WebDAV + DeltaV
>which svn uses?  One problem there was with SVN - it may have been fixed
>by now, and a fix would be necessary for it to be usable for GCC - was its
>use of HTTP and HTTPS (for write access); these tend to be heavily
>controlled by firewalls and the ability to tunnel over SSH (with just that
>one port needing to be open) would be necessary.  "Transparent" proxies
>may pass plain HTTP OK, but not the WebDAV/DeltaV extensions SVN needs.
>

There is now a new repository access layer in place that can be easily
piped over SSH and doesn't require Apache on the server side. It's not
as well tested yet, of course.


-- 
Brane ÄŒibej   <brane@xbc.nu>   http://www.xbc.nu/brane/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-08 14:18 ` source mgt. requirements solicitation Tom Lord
  2002-12-08 14:56   ` DJ Delorie
  2002-12-08 16:09   ` Phil Edwards
@ 2002-12-08 18:32   ` Joseph S. Myers
  2002-12-11  2:48     ` Branko Čibej
  2 siblings, 1 reply; 60+ messages in thread
From: Joseph S. Myers @ 2002-12-08 18:32 UTC (permalink / raw)
  To: Tom Lord; +Cc: gcc

On Sun, 8 Dec 2002, Tom Lord wrote:

> 	1) There are frequent reports on this list of glitches with
> 	   the current CVS repository.

The most common problem relates to the fileattr performance optimization.  
There are known causes, and a known workaround (remove the cache files 
when the problem occurs).

Other problems (occassional repository corruption) may often relate to
hardware problems.  BK uses extensive checksumming to detect such failures
early (since early detection means backups can more easily be found); the
RCS format has no checksums.  I don't know what svn or arch do here.

There are particular issues that are relevant to GCC (and other CVS users)
that SVN addresses or intends to address as a "better CVS":

* Proper file renaming support.
* Atomic checkins across multiple files (rarely a problem).
* O(1) performance of tag and branch operations.  (A major issue for the 
snapshot script; when the machine is loaded it can take hours to tag the 
tree with the per-snapshot tag, remove the old gcc_latest snapshot tag and 
apply the new one (writing to every ,v file several times).  Part of the 
problem, however, is waiting on locks in each directory, and reducing the 
extent to which locks are needed (e.g. avoiding them for anonymous 
checkouts) and the time for which they are held would help.)
* Performance of operations (checkout, update, ...) on branches (reading 
every file in the tree; the cache mentioned above avoids this problem for 
HEAD only).
* cvs update -d and modules (more an issue with merged gcc and src trees) 
(I don't know whether svn does modules yet).

I haven't seen an obvious need for major changes in branch merging or 
distributed repositories, but people making heavy use of branches may well 
have a use for better tools there.  It's just that something (a) 
supporting file renames and (b) having much better performance (including 
on branches) and (c) having better reliability would solve most of the 
problems for most of the users.  (Not all problems for all users, better 
tools aiming towards that are still useful if they don't cause more 
trouble in the common case.  Checkout, update, diff, annotate, commit 
shouldn't be made any more complicated.)

> 	2) GCC, more than many projects, relies on a distributed
>            testing effort, which mostly applies to the HEAD revision
> 	   and to release candidates.  Most of this testing is done
> 	   by hand.

Better tools are useful here (I always want more testing and more
testcases) but it isn't much to do with version control, rather with
processing the test results into a coherent form (there used to be a
database driven from gcc-testresults) and getting people to fix
regressions they cause (not a problem lately, but there have been long
periods with the regression tester showing regressions staying for weeks).

> 	7) Questions about which patches relate to which issues in the
> 	   issue database are fairly common.

Better tools may help if they encourage volunteers to do the boring task 
of going through incoming bug reports and checking they include enough 
information to reproduce them and can be reproduced.  But that's a matter 
of the long-delayed Bugzilla transition (delayed by human time to set up a 
new machine, not by lack of better version control) possibly linked with 
some system for bug reports to have enough well-defined fields for 
automatic testing.

> 	9) Distributed testing occurs mostly on the HEAD -- which
>            means that the HEAD breaks on various targets, fairly
>            frequently.

It means that HEAD breakage is frequently detected.

> 	11) Some efforts, such as overhauling the build process, will
> 	    probably benefit from a switch to rev ctl. systems that
> 	    support tree rearrangements.

I think it's better to just do renames the CVS way (delete and add) now,
rather than waiting, then when changing make the repository conversion
tool smart enough to handle most of the renames that have taken place in
the GCC repository.

Better tools such as svn or arch may be useful, but we're not CM
developers so it's just a matter of evaluating such tools when they are
ready (do all the common things CVS does just as easily, are reliable
enough, have good enough (preferably better than CVS) performance for what
we do, solve some of the problems with CVS).  Indications (such as above)
of problems with CVS for GCC aren't particularly important, since the main
problems with CVS are well known and affect GCC much as they affect other
projects.

-- 
Joseph S. Myers
jsm28@cam.ac.uk

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: source mgt. requirements solicitation
  2002-12-08 18:32   ` Joseph S. Myers
@ 2002-12-11  2:48     ` Branko Čibej
  0 siblings, 0 replies; 60+ messages in thread
From: Branko Čibej @ 2002-12-11  2:48 UTC (permalink / raw)
  To: Joseph S. Myers; +Cc: gcc

Joseph S. Myers wrote:

>* cvs update -d and modules (more an issue with merged gcc and src trees) 
>(I don't know whether svn does modules yet).
>  
>
Subversion does modules a lot better than CVS, if I do say so myself. See

    http://svnbook.red-bean.com/book.html#svn-ch-6-sect-3

-- 
Brane ÄŒibej   <brane@xbc.nu>   http://www.xbc.nu/brane/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Itching and scratching (Re: source mgt. requirements solicitation)
@ 2002-12-18 21:44 Robert Dewar
  0 siblings, 0 replies; 60+ messages in thread
From: Robert Dewar @ 2002-12-18 21:44 UTC (permalink / raw)
  To: lord, shebs; +Cc: gcc

> Speaking more generally, the folks that get paid to do free software
> are the ones who are solving the problems of people with the money.
> It's up to us to be clever enough to figure out to solve the specific
> problems in a way that improves architecture and infrastructure.
> That was a key but underappreciated aspect of Cygnus' development
> contracts; we would always try to go after projects that included
> infrastructure improvement, but if necessary we would do something
> that was random but lucrative and use the profits to pay for
> generic work.

For the record, this is very similar to ACT's approach to development
contracts.

^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2002-12-19  2:39 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-12-08  7:13 on reputation and lines and putting things places (Re: gcc branches?) Robert Dewar
2002-12-08 14:18 ` source mgt. requirements solicitation Tom Lord
2002-12-08 14:56   ` DJ Delorie
2002-12-08 15:02     ` David S. Miller
2002-12-08 15:45       ` Bruce Stephens
2002-12-08 16:52         ` David S. Miller
2002-12-08 15:11     ` Bruce Stephens
2002-12-08 16:24       ` Joseph S. Myers
2002-12-08 16:47         ` Tom Lord
2002-12-08 22:20           ` Craig Rodrigues
2002-12-08 16:09   ` Phil Edwards
2002-12-08 19:13     ` Zack Weinberg
2002-12-09 10:33       ` Phil Edwards
2002-12-09 11:06       ` Joseph S. Myers
2002-12-09  9:42         ` Zack Weinberg
2002-12-09 11:00           ` Jack Lloyd
2002-12-09 15:10     ` Walter Landry
2002-12-09 15:27       ` Joseph S. Myers
2002-12-09 17:05         ` Walter Landry
2002-12-09 17:10           ` Joseph S. Myers
2002-12-09 18:27             ` Walter Landry
2002-12-09 19:16               ` Joseph S. Myers
2002-12-10  0:27                 ` Zack Weinberg
2002-12-10  0:41                   ` Tom Lord
2002-12-10 12:05                   ` Phil Edwards
2002-12-10 19:44                   ` Mark Mielke
2002-12-10 19:57                     ` David S. Miller
2002-12-10 20:02                       ` Phil Edwards
2002-12-10 23:07                         ` David S. Miller
2002-12-11  6:31                           ` Phil Edwards
2002-12-14 13:43                   ` Linus Torvalds
2002-12-14 14:06                     ` Tom Lord
2002-12-14 17:44                       ` Linus Torvalds
2002-12-14 19:45                         ` Tom Lord
2002-12-14 14:41                     ` Neil Booth
2002-12-14 15:47                       ` Zack Weinberg
2002-12-14 15:33                     ` Momchil Velikov
2002-12-14 16:06                       ` Linus Torvalds
2002-12-15  3:59                         ` Momchil Velikov
2002-12-15  8:26                         ` Momchil Velikov
2002-12-15 12:02                           ` Linus Torvalds
2002-12-15 14:16                             ` Momchil Velikov
2002-12-15 15:20                               ` Pop Sébastian
2002-12-15 16:09                                 ` Linus Torvalds
2002-12-15 16:49                                   ` Bruce Stephens
2002-12-15 16:59                                     ` Linus Torvalds
2002-12-15 18:10                                       ` Bruce Stephens
2002-12-16  8:32                                       ` Diego Novillo
2002-12-17  3:36                                         ` Pop Sébastian
2002-12-17 13:14                                           ` Tom Lord
2002-12-17 15:28                                             ` Itching and scratching (Re: source mgt. requirements solicitation) Stan Shebs
2002-12-17 16:07                                               ` Tom Lord
2002-12-17 15:46                                                 ` Stan Shebs
2002-12-16 17:22                                   ` source mgt. requirements solicitation Mike Stump
2002-12-15 17:09                         ` Stan Shebs
2002-12-09 17:50           ` Zack Weinberg
2002-12-11  1:11         ` Branko Čibej
2002-12-08 18:32   ` Joseph S. Myers
2002-12-11  2:48     ` Branko Čibej
2002-12-18 21:44 Itching and scratching (Re: source mgt. requirements solicitation) Robert Dewar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).