From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 48701 invoked by alias); 31 Aug 2016 11:56:26 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Received: (qmail 48690 invoked by uid 89); 31 Aug 2016 11:56:25 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-3.4 required=5.0 tests=BAYES_00,RP_MATCHES_RCVD,SPF_HELO_PASS autolearn=ham version=3.3.2 spammy=deftypefun, frequently, 1374,75, @bullet X-HELO: mx1.redhat.com Subject: Re: [PATCH] manual: Clarify the documentation of strverscmp [BZ #20524] To: "Michael Kerrisk (man-pages)" References: <20160829202145.A2548403DC4A9@oldenburg.str.redhat.com> <44b7d954-354c-2407-3079-fd511dd1f510@redhat.com> <929dbcd3-6984-c17f-1193-2e2ac8aeaf14@gmail.com> Cc: libc-alpha From: Florian Weimer Message-ID: <3147398a-fa03-af37-f0d3-230e46182b51@redhat.com> Date: Wed, 31 Aug 2016 11:56:00 -0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.2.0 MIME-Version: 1.0 In-Reply-To: <929dbcd3-6984-c17f-1193-2e2ac8aeaf14@gmail.com> Content-Type: multipart/mixed; boundary="------------955E289A7A0356B8C26DFA1A" X-SW-Source: 2016-08/txt/msg00925.txt.bz2 This is a multi-part message in MIME format. --------------955E289A7A0356B8C26DFA1A Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-length: 1141 On 08/30/2016 08:59 PM, Michael Kerrisk (man-pages) wrote: >> the other sequence is truncated to be of the same (extended) length, and >> these two sequences are compared lexicographically. In this last case, > > s/last// ? Maybe “in the last case”? >> the sequence comparison determines the result of the function because >> the extension character (or some character before it) is necessarily >> different from the character at the same offset in the other input >> string. >>>> +@item >>>> +If both digit sequences have an equal, positive number of leading zeros, >>> >>> Why is the word "positive" here? >> >> It's used as a synonym for “non-zero” (to discriminate this case from >> the previous one). > > Ahh -- okay. Somehow that wording threw me. > > Maybe: > > "If both digit sequences start with a zero and have an equal number > of leading zeros..." Thanks. New patch attached. Perhaps we should just say the comparison happens in an implementation-defined manner and is only consistent within a single process (just like strcoll). It would simplify matters and would us allow to fix the algorithm. :) Florian --------------955E289A7A0356B8C26DFA1A Content-Type: text/x-patch; name="strverscmp.patch" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="strverscmp.patch" Content-length: 4987 manual: Clarify the documentation of strverscmp [BZ #20524] 2016-08-31 Florian Weimer [BZ #20524] * manual/string.texi (String/Array Comparison): Clarify the strverscmp behavior. diff --git a/manual/string.texi b/manual/string.texi index bce81a7..1986357 100644 --- a/manual/string.texi +++ b/manual/string.texi @@ -1374,46 +1374,75 @@ The @code{strverscmp} function compares the string @var{s1} against @var{s2}, considering them as holding indices/version numbers. The return value follows the same conventions as found in the @code{strcmp} function. In fact, if @var{s1} and @var{s2} contain no -digits, @code{strverscmp} behaves like @code{strcmp}. +digits, @code{strverscmp} behaves like @code{strcmp} +(in the sense that the sign of the result is the same). -Basically, we compare strings normally (byte by byte), until -we find a digit in each string - then we enter a special comparison -mode, where each sequence of digits is taken as a whole. If we reach the -end of these two parts without noticing a difference, we return to the -standard comparison mode. There are two types of numeric parts: -"integral" and "fractional" (those begin with a '0'). The types -of the numeric parts affect the way we sort them: +The comparison algorithm which the @code{strverscmp} function implements +differs slightly from other version-comparison algorithms. The +implementation is based on a finite-state machine, whose behavior is +approximated below. @itemize @bullet @item -integral/integral: we compare values as you would expect. +The input strings are each split into sequences of non-digits and +digits. These sequences can be empty at the beginning and end of the +string. Digits are determined by the @code{isdigit} function and are +thus subject to the current locale. @item -fractional/integral: the fractional part is less than the integral one. -Again, no surprise. +Comparison starts with a (possibly empty) non-digit sequence. The first +non-equal sequences of non-digits or digits determines the outcome of +the comparison. @item -fractional/fractional: the things become a bit more complex. -If the common prefix contains only leading zeroes, the longest part is less -than the other one; else the comparison behaves normally. +Corresponding non-digit sequences in both strings are compared +lexicographically if their lengths are equal. If the lengths differ, +the shorter non-digit sequence is extended with the input string +character immediately following it (which may be the null terminator), +the other sequence is truncated to be of the same (extended) length, and +these two sequences are compared lexicographically. In the last case, +the sequence comparison determines the result of the function because +the extension character (or some character before it) is necessarily +different from the character at the same offset in the other input +string. + +@item +For two sequences of digits, the number of leading zeros is counted (which +can be zero). If the count differs, the string with more leading zeros +in the digit sequence is considered smaller than the other string. + +@item +If the two sequences of digits have no leading zeros, they are compared +as integers, that is, the string with the longer digit sequence is +deemed larger, and if both sequences are of equal length, they are +compared lexicographically. + +@item +If both digit sequences start with a zero and have an equal number of +leading zeros, they are compared lexicographically if their lengths are +the same. If the lengths differ, the shorter sequence is extended with +the following character in its input string, and the other sequence is +truncated to the same length, and both sequences are compared +lexicographically (similar to the non-digit sequence case above). @end itemize +The treatment of leading zeros and the tie-breaking extension characters +(which in effect propagate across non-digit/digit sequence boundaries) +differs from other version-comparison algorithms. + @smallexample strverscmp ("no digit", "no digit") @result{} 0 /* @r{same behavior as strcmp.} */ strverscmp ("item#99", "item#100") @result{} <0 /* @r{same prefix, but 99 < 100.} */ strverscmp ("alpha1", "alpha001") - @result{} >0 /* @r{fractional part inferior to integral one.} */ + @result{} >0 /* @r{different number of leading zeros (0 and 2).} */ strverscmp ("part1_f012", "part1_f01") - @result{} >0 /* @r{two fractional parts.} */ + @result{} >0 /* @r{lexicographical comparison with leading zeros.} */ strverscmp ("foo.009", "foo.0") - @result{} <0 /* @r{idem, but with leading zeroes only.} */ + @result{} <0 /* @r{different number of leading zeros (2 and 1).} */ @end smallexample -This function is especially useful when dealing with filename sorting, -because filenames frequently hold indices/version numbers. - @code{strverscmp} is a GNU extension. @end deftypefun --------------955E289A7A0356B8C26DFA1A--