Re: Subject header matching--once again

From: Mel Hatzis <hatzis@juniper.net>
To: Lars Henriksen <Lars.Henriksen@netman.dk>
Cc: help-gnats@gnu.org, Yngve Svendsen <yngve.svendsen@sun.com>,
	Milan Zamazal <pdm@zamazal.org>
Subject: Re: Subject header matching--once again
Date: Mon, 04 Nov 2002 11:31:00 -0000	[thread overview]
Message-ID: <3DC602D4.1070706@juniper.net> (raw)
In-Reply-To: <20021102213505.GB646077@cluster1.netman.dk>

Nice summary....more comments below.

On 11/02/2002 01:35 PM, Lars Henriksen submitted:

>Hello,
>
>This is a story in four parts about how to recognize a PR in the Subject: header
>of an email sent to GNATS.
>
>PART I
>======
>Back in December 2001 there was a thread discussing Subject: matching, why it
>was so restrictive, how to loosen it and make it more useful. This is the mail
>that started the discussion:
>
>http://mail.gnu.org/pipermail/help-gnats/2001-December/002617.html
>
>Milan concluded:
>
>  
>
>>Well, to summarize, I suggest the following:
>>
>>- Not to try to match "re", "fw", etc. and simply look for a substring
>>  as stated below.
>>
>>- Accept "\<CATEGORY/NUMBER" where CATEGORY is a valid category name and
>>  PR NUMBER is present in CATEGORY.
>>    
>>- Accept "\<PR[ \t/]*NUMBER" where "PR" must be all capitals and NUMBER
>>  corresponds to an existing PR number.
>>    
>>
>
>The second requirement was changed to:
>
>DB>                                                 Accept
>DB> "\<CATEGORY/NUMBER" where CATEGORY is a valid category name and
>DB> NUMBER is a valid PR number (in any category).
>
>It was also debated how to handle a Subject: with several matches (the
>infamous Subject: containing OS/2). Milan ended the thread:
>
>  
>
>>The matching was adjusted according to our consensus, except that I
>>didn't bother with the matching cycle, only the first possible match is
>>considered. I haven't tested it, would you like to do it?
>>    
>>
>
>PART II
>=======
>In February, someone noticed that Subject: matching had stopped working.
>
>http://mail.gnu.org/pipermail/help-gnats/2002-February/002808.html
>
>Andrew Gray submitted a patch:
>
>http://mail.gnu.org/pipermail/help-gnats/2002-March/002831.html
>
>which fixed two bugs and made Subject: matching work again. But the fix was
>not committed and went unnoticed for a time (see PART III).
>
>PART III
>========
>In May, Mel Hatzis noticed the same problem and submitted a fix. After some
>email exchanges Andrew Gray's fix from March was committed so all should be
>well. But it wasn't (and it isn't).
>
>For one thing, the agreement from December is not in the GNATS documentation.
>This may be just as well because the code does not implement it (fully). The
>regular expression now used for matching a PR is:
>
>     \<((PR[ \t/])\|([-a-z0-9_+.]+)/)([0-9]+)
>
>Here regex groups two and three are not used, and the expression may be
>simplified to:
>
>     \<(PR[ \t/]\|[-a-z0-9_+.]+/)([0-9]+
>
>The check for upper case PR is by appearance only because Subject: matching
>always ignores case (notice that the check for category name has no upper
>case letters). This point has never been raised before, but I think it should
>be. I believe that Subject: matching should be case sensitive. Not just to be
>able to check for PR, but simply because it is useful (see PART IV).
>
>Furthermore, PR1234 is not accepted. This form was explicitly mentioned in
>December as desirable. I would like to add PR#1234 and exclude PR/1234.
>
>My proposal is that the regular expression be changed to
>
>     \<(PR[ \t#]?\|[-\w+.]+/)([0-9]+)
>
>and that the regex search be made case sensitive.
>
>This will pick up the first appearance of any of PR 1234, PR#1234, PR1234,
>category/1234 and Category/1234 anywhere in the Subject: header.
>
>PART IV
>=======
>During the discussion in May, Mel Hatzis suggested that the regular expression
>be made configurable via dbconfig:
>
>http://mail.gnu.org/pipermail/help-gnats/2002-May/002901.html
>
>Everyone who uttered an opinion was in favour, and so am I. Mel submitted a
>patch that is still pending:
>
>http://mail.gnu.org/pipermail/help-gnats/2002-May/002930.html
>
>  
>
>>You can now (optionally) include the following in the database-info
>>section of the dbconfig file:
>>
>>    # The regular expression used to determine whether a PR is referenced
>>    # on an email subject line
>>    subject-matching {
>>        "\\<((PR[ \t/])\\|([-a-z0-9_+.]+/))([0-9]+)"
>>        capture-group "4"
>>    }
>>
>>(The above example is exactly analogous to the built-in default)
>>    
>>
>
>The regex group identified by capture-group must capture the PR number.
>
>The proposal is fine, but has a drawback that is worth a discussion. With the
>built-in default (the one discussed in parts I-III) both category, if present,
>and PR number are checked for validity. With a subject-matching entry in
>dbconfig, it is impossible to check the validity of a category: there is no way
>of checking the category since only the PR number is captured. Hence, it
>is not true that the example above, as stated, is exactly analogous to the
>built-in default. At least not by design.
>
Fair enough...your point is well taken.

>
>The last remark in the previous paragraph alludes to a bug in Mel's patch. The
>code extracts the PR number from the matching substring, but unconditionally
>checks for a preceding "PR" or category. The patch also makes gnatsd dump core
>if no subject-matching entry is present in dbconfig, but that is an easy fix.
>
Regarding the unconditional check for a preceding "PR" or category,
this is what is currently implemented - I still assert that the example
I provided is analogous to the current built-in default (except of course,
that the category is not verified as you state above).

>
>My proposal is to extend Mel's design by allowing the category name to be
>captured optionally and checked for validity.
>
>The dbconfig entry syntax could be:
>
>     subject-matching {
>          "regular_expression_with_groups"
>          pr-group "integer"
>          category-group "integer"
>     }
>
>The first integer is the regex group containing the PR number, the second the
>regex group containing the category name or 0 (zero) if not used. The following
>entry is equivalent to the built-in default (with my amendments):
>
>     subject-matching {
>          "\\<(PR[ \t#]?\\|([-\\w+.]+)/)([0-9]+)"
>          pr-group "3"
>          category-group "2"
>     }
>
>An entry that does not use category is:
>
>     subject-matching {
>          "\\<PR[ \t#]?([0-9]+)"
>          pr-group "1"
>          category-group "0"
>     }
>
Building on your proposal, I suggest that it'd be even better if an
array of capture groups could be specified, each associated with a
field name. This would allow for fields other than 'category' on the
subject line.

This could take the following form:

  subject-matching {
      "\\<PR[ \t#/]?([0-9]+)[ \t]?:(.*)"
      captured-fields {
            "Number" "Synopsis"
      }
  }

The example above would match subject lines of the form:

   "PR 333 : missing subject-matching clause causes gnatsd to dump core"

(verifying that the synopsis matched the PR number before accepting it
as a reference to PR 333)

>It should also be decided which syntax bits to use for Subject: matching.
>At present only RE_NO_BK_PARENS is set, but why? Setting e.g. RE_NO_BK_VBAR
>would avoid the need to escape the alternation operator. Milan suggested
>using the same syntax bits as the rest of gnats:
>
>     (RE_SYNTAX_POSIX_EXTENDED | RE_BK_PLUS_QM) & ~RE_DOT_NEWLINE
>
>but these are an issue in their own right, and this email is already becoming
>too long.
>
I agree with Milan that we should be consistent....though like you, I find
these syntax flags questionable. I was going to test my patch against these
syntax flags, and resubmit, but I'll hold off a while and see what comes
of this email thread.

--
Mel Hatzis
Juniper Networks, Inc.

_______________________________________________
Help-gnats mailing list
Help-gnats@gnu.org
http://mail.gnu.org/mailman/listinfo/help-gnats