From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from tarta.nabijaczleweli.xyz (unknown [139.28.40.42]) by sourceware.org (Postfix) with ESMTP id 0835A3858D37 for ; Fri, 21 Apr 2023 01:41:25 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 0835A3858D37 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=nabijaczleweli.xyz Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=nabijaczleweli.xyz Received: from tarta.nabijaczleweli.xyz (unknown [192.168.1.250]) by tarta.nabijaczleweli.xyz (Postfix) with ESMTPSA id 0D0A060FC; Fri, 21 Apr 2023 03:41:24 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=nabijaczleweli.xyz; s=202211; t=1682041284; bh=p0uwTJoO0/yY5A8nhNfpTmps0pfrwmM6dwDEApwUKo4=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=DM83J7FkUx6SLnRkrclL0vI3rICV7bRrNKH+5OzRFRWnpKg8nLRQuXLKUTWC1h79c 8tfpM5i67nfcFY4s2qdKzM16/MvJ/fBVStcI7hqfIs5OpwogD0tgdu0UdyGJKAkJja G5linJVFsqMI+AYBmzq+sMby7zaRjXF5mAAGRvTdAsoaSs7F7pDVIIrsM1Mg840Mvy Sz5nr//cqp22yiVglxjilt5yT4a71OAr948/xkkFMwNfVty13EDrI1dcozABEf2JuB w25ESoaEW9AE1/hz+Q2h8DIwjheUNJfLQB/+Beo+4Pp4kQxyvEjI3xNYbbI9AqoaWH JsAr896zlltQA== Date: Fri, 21 Apr 2023 03:41:22 +0200 From: =?utf-8?B?0L3QsNCx?= To: Alejandro Colomar Cc: GNU C Library , Siddhesh Poyarekar Subject: Re: regexec(3): REG_STARTEND is not documented Message-ID: <7j64b34c7gjtltgqp4iyha335pkfwelj7miemozm6xcd3oy7ic@tdchlurmv2kk> References: <0de87674-1b35-8dc8-7d2b-8dacd6b015ff@gmail.com> <2f3a3aa5-9e01-8f46-7b98-de03cf304aad@gmail.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="zs6n4ofh6orski6j" Content-Disposition: inline In-Reply-To: <2f3a3aa5-9e01-8f46-7b98-de03cf304aad@gmail.com> User-Agent: NeoMutt/20230407 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,KAM_INFOUSMEBIZ,LIKELY_SPAM_BODY,RDNS_DYNAMIC,SPF_HELO_PASS,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: --zs6n4ofh6orski6j Content-Type: multipart/mixed; boundary="b26mzkmyyl6azjzj" Content-Disposition: inline --b26mzkmyyl6azjzj Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Hi! On Fri, Apr 21, 2023 at 03:28:42AM +0200, Alejandro Colomar wrote: > On 4/21/23 03:15, =D0=BD=D0=B0=D0=B1 wrote: > > On Fri, Apr 21, 2023 at 03:07:00AM +0200, Alejandro Colomar wrote: > >> Here's a related question: > >> regmatch_t pmatch =3D { > >> .rm_so =3D string, > >> .rm_eo =3D string + 42, // Assume this offset is valid > >> }; > >> regexec(preg, string, 0, pmatch, REG_STARTEND); > >> Should regexec(3) write to the 1st element in pmatch[] because it knows > >> it exists (otherwise the call would be UB because it needs to read it)? > > (Which would run counter to how POSIX defines the API.) > >=20 > >> Or is passing 0 in nmatch effectively another way of performing > >> REG_NOSUB behavior without actually using the flag? > > Hilariously enough, quoth 4.4BSD-Lite regex(3) again, > > which phrases it exactly like you do: > > If REG_NOSUB was specified in the compilation of the RE, or if nmat= ch > > is 0, regexec ignores the pmatch argument (but see below for the ca= se > > where REG_STARTEND is specified). > Touche; it looks like your right. That sentence is unambiguous. BTW, is= the > reference to some other text about REG_STARTEND the one quoted first > (above)? That's the first mention of [np]match afterward, yeah. The file's massive, so I've attached it here, you can read more brain-destroying minutiae there; it's modern-enough mdoc(7) that anything should render it sensibly. I've also noticed (even later, did I mention they couldn't stop writing?): If REG_STARTEND is specified, pmatch must point to at least one reg=E2= =80=90 match_t (even if nmatch is 0 or REG_NOSUB was specified), to hold the input offsets for REG_STARTEND. Use for output is still entirely con=E2= =80=90 trolled by nmatch; if nmatch is 0 or REG_NOSUB was specified, the value of pmatch[0] will not be changed by a successful regexec. Best, =D0=BD=D0=B0=D0=B1 --b26mzkmyyl6azjzj Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="regex.3" Content-Transfer-Encoding: quoted-printable =2E\" Copyright (c) 1992 Henry Spencer. =2E\" Copyright (c) 1992, 1993 =2E\" The Regents of the University of California. All rights reserved. =2E\" =2E\" This code is derived from software contributed to Berkeley by =2E\" Henry Spencer of the University of Toronto. =2E\" =2E\" Redistribution and use in source and binary forms, with or without =2E\" modification, are permitted provided that the following conditions =2E\" are met: =2E\" 1. Redistributions of source code must retain the above copyright =2E\" notice, this list of conditions and the following disclaimer. =2E\" 2. Redistributions in binary form must reproduce the above copyright =2E\" notice, this list of conditions and the following disclaimer in the =2E\" documentation and/or other materials provided with the distributio= n. =2E\" 3. All advertising materials mentioning features or use of this softw= are =2E\" must display the following acknowledgement: =2E\" This product includes software developed by the University of =2E\" California, Berkeley and its contributors. =2E\" 4. Neither the name of the University nor the names of its contributo= rs =2E\" may be used to endorse or promote products derived from this softw= are =2E\" without specific prior written permission. =2E\" =2E\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' A= ND =2E\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE =2E\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PU= RPOSE =2E\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIA= BLE =2E\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUE= NTIAL =2E\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOO= DS =2E\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) =2E\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, S= TRICT =2E\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY= WAY =2E\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF =2E\" SUCH DAMAGE. =2E\" =2E\" @(#)regex.3 8.1 (Berkeley) 6/17/93 =2E\" =2Ede ZR =2E\" one other place knows this name: the SEE ALSO section =2EIR re_format (7) \\$1 =2E. =2ETH REGEX 3 "June 17, 1993" =2ESH NAME regcomp, regexec, regerror, regfree \- regular-expression library =2ESH SYNOPSIS =2Eft B =2E\".na #include =2Ebr #include =2EHP 10 int regcomp(regex_t\ *preg, const\ char\ *pattern, int\ cflags); =2EHP int\ regexec(const\ regex_t\ *preg, const\ char\ *string, size_t\ nmatch, regmatch_t\ pmatch[], int\ eflags); =2EHP size_t\ regerror(int\ errcode, const\ regex_t\ *preg, char\ *errbuf, size_t\ errbuf_size); =2EHP void\ regfree(regex_t\ *preg); =2E\".ad =2Eft =2ESH DESCRIPTION These routines implement POSIX 1003.2 regular expressions (``RE''s); see =2EZR . =2EI Regcomp compiles an RE written as a string into an internal form, =2EI regexec matches that internal form against a string and reports results, =2EI regerror transforms error codes from either into human-readable messages, and =2EI regfree frees any dynamically-allocated storage used by the internal form of an RE. =2EPP The header =2EI declares two structure types, =2EI regex_t and =2EIR regmatch_t , the former for compiled internal forms and the latter for match reporting. It also declares the four functions, a type =2EIR regoff_t , and a number of constants with names starting with ``REG_''. =2EPP =2EI Regcomp compiles the regular expression contained in the =2EI pattern string, subject to the flags in =2EIR cflags , and places the results in the =2EI regex_t structure pointed to by =2EIR preg . =2EI Cflags is the bitwise OR of zero or more of the following flags: =2EIP REG_EXTENDED \w'REG_EXTENDED'u+2n Compile modern (``extended'') REs, rather than the obsolete (``basic'') REs that are the default. =2EIP REG_BASIC This is a synonym for 0, provided as a counterpart to REG_EXTENDED to improve readability. =2EIP REG_NOSPEC Compile with recognition of all special characters turned off. All characters are thus considered ordinary, so the ``RE'' is a literal string. This is an extension, compatible with but not specified by POSIX 1003.2, and should be used with caution in software intended to be portable to other systems. REG_EXTENDED and REG_NOSPEC may not be used in the same call to =2EIR regcomp . =2EIP REG_ICASE Compile for matching that ignores upper/lower case distinctions. See =2EZR . =2EIP REG_NOSUB Compile for matching that need only report success or failure, not what was matched. =2EIP REG_NEWLINE Compile for newline-sensitive matching. By default, newline is a completely ordinary character with no special meaning in either REs or strings. With this flag, `[^' bracket expressions and `.' never match newline, a `^' anchor matches the null string after any newline in the string in addition to its normal function, and the `$' anchor matches the null string before any newline in the string in addition to its normal function. =2EIP REG_PEND The regular expression ends, not at the first NUL, but just before the character pointed to by the =2EI re_endp member of the structure pointed to by =2EIR preg . The =2EI re_endp member is of type =2EIR const\ char\ * . This flag permits inclusion of NULs in the RE; they are considered ordinary characters. This is an extension, compatible with but not specified by POSIX 1003.2, and should be used with caution in software intended to be portable to other systems. =2EPP When successful, =2EI regcomp returns 0 and fills in the structure pointed to by =2EIR preg . One member of that structure (other than =2EIR re_endp ) is publicized: =2EIR re_nsub , of type =2EIR size_t , contains the number of parenthesized subexpressions within the RE (except that the value of this member is undefined if the REG_NOSUB flag was used). If =2EI regcomp fails, it returns a non-zero error code; see DIAGNOSTICS. =2EPP =2EI Regexec matches the compiled RE pointed to by =2EI preg against the =2EIR string , subject to the flags in =2EIR eflags , and reports results using =2EIR nmatch , =2EIR pmatch , and the returned value. The RE must have been compiled by a previous invocation of =2EIR regcomp . The compiled form is not altered during execution of =2EIR regexec , so a single compiled RE can be used simultaneously by multiple threads. =2EPP By default, the NUL-terminated string pointed to by =2EI string is considered to be the text of an entire line, minus any terminating newline. The =2EI eflags argument is the bitwise OR of zero or more of the following flags: =2EIP REG_NOTBOL \w'REG_STARTEND'u+2n The first character of the string is not the beginning of a line, so the `^' anchor should not match before i= t. This does not affect the behavior of newlines under REG_NEWLINE. =2EIP REG_NOTEOL The NUL terminating the string does not end a line, so the `$' anchor should not match before it. This does not affect the behavior of newlines under REG_NEWLINE. =2EIP REG_STARTEND The string is considered to start at \fIstring\fR\ + \fIpmatch\fR[0].\fIrm_so\fR and to have a terminating NUL located at \fIstring\fR\ + \fIpmatch\fR[0].\fIrm_eo\fR (there need not actually be a NUL at that location), regardless of the value of =2EIR nmatch . See below for the definition of =2EIR pmatch and =2EIR nmatch . This is an extension, compatible with but not specified by POSIX 1003.2, and should be used with caution in software intended to be portable to other systems. Note that a non-zero \fIrm_so\fR does not imply REG_NOTBOL; REG_STARTEND affects only the location of the string, not how it is matched. =2EPP See =2EZR for a discussion of what is matched in situations where an RE or a portion thereof could match any of several substrings of =2EIR string . =2EPP Normally, =2EI regexec returns 0 for success and the non-zero code REG_NOMATCH for failure. Other non-zero error codes may be returned in exceptional situations; see DIAGNOSTICS. =2EPP If REG_NOSUB was specified in the compilation of the RE, or if =2EI nmatch is 0, =2EI regexec ignores the =2EI pmatch argument (but see below for the case where REG_STARTEND is specified). Otherwise, =2EI pmatch points to an array of =2EI nmatch structures of type =2EIR regmatch_t . Such a structure has at least the members =2EI rm_so and =2EIR rm_eo , both of type =2EI regoff_t (a signed arithmetic type at least as large as an =2EI off_t and a =2EIR ssize_t ), containing respectively the offset of the first character of a substring and the offset of the first character after the end of the substring. Offsets are measured from the beginning of the =2EI string argument given to =2EIR regexec . An empty substring is denoted by equal offsets, both indicating the character following the empty substring. =2EPP The 0th member of the =2EI pmatch array is filled in to indicate what substring of =2EI string was matched by the entire RE. Remaining members report what substring was matched by parenthesized subexpressions within the RE; member =2EI i reports subexpression =2EIR i , with subexpressions counted (starting at 1) by the order of their opening parentheses in the RE, left to right. Unused entries in the array\(emcorresponding either to subexpressions that did not participate in the match at all, or to subexpressions that do not exist in the RE (that is, \fIi\fR\ > \fIpreg\fR\->\fIre_nsub\fR)\(emhave bo= th =2EI rm_so and =2EI rm_eo set to \-1. If a subexpression participated in the match several times, the reported substring is the last one it matched. (Note, as an example in particular, that when the RE `(b*)+' matches `bbb', the parenthesized subexpression matches each of the three `b's and then an infinite number of empty strings following the last `b', so the reported substring is one of the empties.) =2EPP If REG_STARTEND is specified, =2EI pmatch must point to at least one =2EI regmatch_t (even if =2EI nmatch is 0 or REG_NOSUB was specified), to hold the input offsets for REG_STARTEND. Use for output is still entirely controlled by =2EIR nmatch ; if =2EI nmatch is 0 or REG_NOSUB was specified, the value of =2EIR pmatch [0] will not be changed by a successful =2EIR regexec . =2EPP =2EI Regerror maps a non-zero =2EI errcode =66rom either =2EI regcomp or =2EI regexec to a human-readable, printable message. If =2EI preg is non-NULL, the error code should have arisen from use of the =2EI regex_t pointed to by =2EIR preg , and if the error code came from =2EIR regcomp , it should have been the result from the most recent =2EI regcomp using that =2EIR regex_t . =2ERI ( Regerror may be able to supply a more detailed message using information =66rom the =2EIR regex_t .) =2EI Regerror places the NUL-terminated message into the buffer pointed to by =2EIR errbuf , limiting the length (including the NUL) to at most =2EI errbuf_size bytes. If the whole message won't fit, as much of it as will fit before the terminating NUL is supplied. In any case, the returned value is the size of buffer needed to hold the whole message (including terminating NUL). If =2EI errbuf_size is 0, =2EI errbuf is ignored but the return value is still correct. =2EPP If the =2EI errcode given to =2EI regerror is first ORed with REG_ITOA, the ``message'' that results is the printable name of the error code, e.g. ``REG_NOMATCH'', rather than an explanation thereof. If =2EI errcode is REG_ATOI, then =2EI preg shall be non-NULL and the =2EI re_endp member of the structure it points to must point to the printable name of an error code; in this case, the result in =2EI errbuf is the decimal digits of the numeric value of the error code (0 if the name is not recognized). REG_ITOA and REG_ATOI are intended primarily as debugging facilities; they are extensions, compatible with but not specified by POSIX 1003.2, and should be used with caution in software intended to be portable to other systems. Be warned also that they are considered experimental and changes are possib= le. =2EPP =2EI Regfree frees any dynamically-allocated storage associated with the compiled RE pointed to by =2EIR preg . The remaining =2EI regex_t is no longer a valid compiled RE and the effect of supplying it to =2EI regexec or =2EI regerror is undefined. =2EPP None of these functions references global variables except for tables of constants; all are safe for use from multiple threads if the arguments are safe. =2ESH IMPLEMENTATION CHOICES There are a number of decisions that 1003.2 leaves up to the implementor, either by explicitly saying ``undefined'' or by virtue of them being forbidden by the RE grammar. This implementation treats them as follows. =2EPP See =2EZR for a discussion of the definition of case-independent matching. =2EPP There is no particular limit on the length of REs, except insofar as memory is limited. Memory usage is approximately linear in RE size, and largely insensitive to RE complexity, except for bounded repetitions. See BUGS for one short RE using them that will run almost any system out of memory. =2EPP A backslashed character other than one specifically given a magic meaning by 1003.2 (such magic meanings occur only in obsolete [``basic''] REs) is taken as an ordinary character. =2EPP Any unmatched [ is a REG_EBRACK error. =2EPP Equivalence classes cannot begin or end bracket-expression ranges. The endpoint of one range cannot begin another. =2EPP RE_DUP_MAX, the limit on repetition counts in bounded repetitions, is 255. =2EPP A repetition operator (?, *, +, or bounds) cannot follow another repetition operator. A repetition operator cannot begin an expression or subexpression or follow `^' or `|'. =2EPP `|' cannot appear first or last in a (sub)expression or after another `|', i.e. an operand of `|' cannot be an empty subexpression. An empty parenthesized subexpression, `()', is legal and matches an empty (sub)string. An empty string is not a legal RE. =2EPP A `{' followed by a digit is considered the beginning of bounds for a bounded repetition, which must then follow the syntax for bounds. A `{' \fInot\fR followed by a digit is considered an ordinary character. =2EPP `^' and `$' beginning and ending subexpressions in obsolete (``basic'') REs are anchors, not ordinary characters. =2ESH SEE ALSO grep(1), re_format(7) =2EPP POSIX 1003.2, sections 2.8 (Regular Expression Notation) and B.5 (C Binding for Regular Expression Matching). =2ESH DIAGNOSTICS Non-zero error codes from =2EI regcomp and =2EI regexec include the following: =2EPP =2Enf =2Eta \w'REG_ECOLLATE'u+3n REG_NOMATCH regexec() failed to match REG_BADPAT invalid regular expression REG_ECOLLATE invalid collating element REG_ECTYPE invalid character class REG_EESCAPE \e applied to unescapable character REG_ESUBREG invalid backreference number REG_EBRACK brackets [ ] not balanced REG_EPAREN parentheses ( ) not balanced REG_EBRACE braces { } not balanced REG_BADBR invalid repetition count(s) in { } REG_ERANGE invalid character range in [ ] REG_ESPACE ran out of memory REG_BADRPT ?, *, or + operand invalid REG_EMPTY empty (sub)expression REG_ASSERT ``can't happen''\(emyou found a bug REG_INVARG invalid argument, e.g. negative-length string =2Efi =2ESH HISTORY Written by Henry Spencer at University of Toronto, henry@zoo.toronto.edu. =2ESH BUGS This is an alpha release with known defects. Please report problems. =2EPP There is one known functionality bug. The implementation of internationalization is incomplete: the locale is always assumed to be the default one of 1003.2, and only the collating elements etc. of that locale are available. =2EPP The back-reference code is subtle and doubts linger about its correctness in complex cases. =2EPP =2EI Regexec performance is poor. This will improve with later releases. =2EI Nmatch exceeding 0 is expensive; =2EI nmatch exceeding 1 is worse. =2EI Regexec is largely insensitive to RE complexity \fIexcept\fR that back references are massively expensive. RE length does matter; in particular, there is a strong speed bonus for keeping RE length under about 30 characters, with most special characters counting roughly double. =2EPP =2EI Regcomp implements bounded repetitions by macro expansion, which is costly in time and space if counts are large or bounded repetitions are nested. An RE like, say, `((((a{1,100}){1,100}){1,100}){1,100}){1,100}' will (eventually) run almost any existing machine out of swap space. =2EPP There are suspected problems with response to obscure error conditions. Notably, certain kinds of internal overflow, produced only by truly enormous REs or by multiply nested bounded repetitio= ns, are probably not handled well. =2EPP Due to a mistake in 1003.2, things like `a)b' are legal REs because `)' is a special character only in the presence of a previous unmatched `('. This can't be fixed until the spec is fixed. =2EPP The standard's definition of back references is vague. For example, does `a\e(\e(b\e)*\e2\e)*d' match `abbbd'? Until the standard is clarified, behavior in such cases should not be relied on. =2EPP The implementation of word-boundary matching is a bit of a kludge, and bugs may lurk in combinations of word-boundary matching and anchoring. --b26mzkmyyl6azjzj-- --zs6n4ofh6orski6j Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEfWlHToQCjFzAxEFjvP0LAY0mWPEFAmRB6cAACgkQvP0LAY0m WPGbdhAAiLoi+7HaZbz1EJHrtc/+loc3eAo+TENaR5Okq9ARHFx4vPjAZJWLhT0i V4S0bxE8LT3rPCmf+fBHrxprZQNYGAU8IrRaa+6DNcFYiiJu83ebJvRXiqRKAd6A dP8vNa3qkXz4fE1E8ElIIuY0Y8RoGCslEHdJgoi1bBTRntuk0bUNFECN5+8MAO0m D3bBi2BpLStv0UUnwJMvNX2skm50dxT5BDeCoGZeMSwNSYIm5+v70aQmsfwekCB+ gMLC3f56aCdVHKmOiNt1tPU/JZYGTK0367FyAChpUJo27JgV1SNSNPr/Q3FbDUjm /Se18fvyYHbk6QIqaazwYFN/KGLp4vZA8uafVomaTjX0LV3CKtPEMLmyxBKoi0AD YBGPmaWbVy0MYtorCXV/PgbvFU7Pnc87AR2JFyDFPq50am+ILrM9+ZbLpMNpcG4r d+VGmmy3SGOffpZMwRaRvN8MLsmrc7XrgIDXA6hD8zAUqbT8JhXJClNqG0IxViNl HMc0ipS/FwyofvN5ip+d102hTiIgzbKjRF6nSBpZ22TZZFt4YCQsjSnIxVF8qS6P 6og1a4ItDmyda2XXI8XdGeJmTXTn1pw8pbYJGVPgpz6kVIRG3+NPzOeX/inGNZ4O VjFxXt14Ibm3GJ+v60Vr0G5LXZuc/QitGUkzPCSWREoHDt2sHTI= =oEA9 -----END PGP SIGNATURE----- --zs6n4ofh6orski6j--