From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-1.mimecast.com (us-smtp-delivery-1.mimecast.com [205.139.110.120]) by sourceware.org (Postfix) with ESMTP id C1333386EC42 for ; Mon, 29 Jun 2020 19:42:13 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org C1333386EC42 Received: from mail-qt1-f199.google.com (mail-qt1-f199.google.com [209.85.160.199]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-252-X59COns7NmabQXWwvYo4_g-1; Mon, 29 Jun 2020 15:42:11 -0400 X-MC-Unique: X59COns7NmabQXWwvYo4_g-1 Received: by mail-qt1-f199.google.com with SMTP id r25so12880180qtj.11 for ; Mon, 29 Jun 2020 12:42:11 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:organization :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=WoHZzBQ8zK3XK3zGSyOnms83qJcRF4qJovVKykfJEXc=; b=RxGJl4vlhrt9O1uKWw+ZFpR7Syfnras1coao6RIWkYByd21iffq/gM98sjwm56RNjw n9kGXz4qhvA7/LRmyMv0SBro3c8XsuhuABLlSK8Bs//sfnxxAuSq19nMFO9i1V2uq1Gp LfOjZoqEQYUL8+sV6huRxJnPy8VuNC43b/O6M5STvg0SUrUr/ocXYTXTwWy8EI+leGGW ijbsFAI7jR7VX3CJAsozdru13+zJjvQ43ShR+I1vFwdrdPTzLRKMRcYE+34Uad+ijQ4s gBKZsJmzjZ+Wbmx2Rc8nYB11DiLULq152M7/yzTQdUnl5nbFSNNdqfHxRvCH7m+P2Jyg QZbw== X-Gm-Message-State: AOAM531kR5JbqOX/rCr32cZ/0G43GlSb5Z02YaNKy+Puc91wZqjz9wx/ bEZzMyCvADxnNZOSdwOTkyYHQ44Lwi18JHfwBxtv+g+ylYClqEV+Hpl3vW/gNPMdIeheK+3IPCi Tt9S3lpbxBk28AyFiynjx X-Received: by 2002:ae9:c209:: with SMTP id j9mr17134901qkg.174.1593459731324; Mon, 29 Jun 2020 12:42:11 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxrYz4fWxRSQhPRlWTVFz+zOAkkU3Eqox3gEpu4bg1/nAA8ffAkWiUE4nZetsOU1Ct17W7o4Q== X-Received: by 2002:ae9:c209:: with SMTP id j9mr17134875qkg.174.1593459730910; Mon, 29 Jun 2020 12:42:10 -0700 (PDT) Received: from [192.168.1.4] (198-84-170-103.cpe.teksavvy.com. [198.84.170.103]) by smtp.gmail.com with ESMTPSA id o5sm898305qtb.26.2020.06.29.12.42.09 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 29 Jun 2020 12:42:10 -0700 (PDT) Subject: Re: [PATCH 1/2] LC_COLLATE: Fix handling of last character in ellipsis, (Bug 22668) To: Florian Weimer , Carlos O'Donell via Libc-alpha Cc: Hanataka Shinya References: <75d21bd8-2698-2e25-969c-4e086c90abd9@redhat.com> <87ftae2st7.fsf@oldenburg2.str.redhat.com> From: Carlos O'Donell Organization: Red Hat Message-ID: <65d95078-155e-f1cb-5883-2db7ed00d289@redhat.com> Date: Mon, 29 Jun 2020 15:42:09 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.7.0 MIME-Version: 1.0 In-Reply-To: <87ftae2st7.fsf@oldenburg2.str.redhat.com> Content-Language: en-US X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-12.4 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 29 Jun 2020 19:42:15 -0000 On 6/29/20 4:13 AM, Florian Weimer wrote: > * Carlos O'Donell via Libc-alpha: > >> During ellipsis processing the collation cursor was not correctly >> moved to the end of the ellipsis after processing. This meant that >> the cursor was left, usually, at the second to last entry. >> Subsequent operations end up unlinking the ellipsis end entry or >> just leaving it in the list dangling from the end. This kind of >> dangling is immediately visible in C.UTF-8 with the following >> sorting from strcoll: >> >> >> >> >> With the cursor correctly adjusted the end entry is correctly given >> the right location and thus the right weight. >> >> No regressions on x86_64 and i686. >> >> Co-authored-by: Carlos O'Donell >> --- >> locale/programs/ld-collate.c | 3 +++ >> 1 file changed, 3 insertions(+) >> >> diff --git a/locale/programs/ld-collate.c b/locale/programs/ld-collate.c >> index feb1a11258..a8ba2f07f0 100644 >> --- a/locale/programs/ld-collate.c >> +++ b/locale/programs/ld-collate.c >> @@ -1483,6 +1483,9 @@ order for `%.*s' already defined at %s:%Zu"), >> } >> } >> } >> + /* Move the cursor to the last entry in the ellipsis. >> + Subsequent operations need to start from the last entry. */ >> + collate->cursor = endp; >> } > > Can't endp be NULL at this point? Yes, but it's an error condition. If it's NULL and symstr != NULL, then we are in an error condition and will trigger the "symbolic range ellipsis must not be directly followed by `order_end'", and we will set the cursor to NULL. It is common to set the cursor to NULL after an error to avoid any subsequent unlink_element() from working due to the error. In practice it looks like we'd crash if you called unlink_element() with the cursor set to NULL e.g. NULL deref in unlink_element(). You shouldn't be doing that anyway, you should be shutting down. I could clarify the error condition in a comment? > Besides that, why does the change make a difference at all? There > already is an assignment to collate->cursor in both “Enqueue the new > element” cases. When we get called we have the following: [cursor] ... element_t <-> element_t <-> element_t "" "..." The cursor points to a double-linked list of collation elements. We enter with the pseudo-entry pointing at the ellipsis. Next we unlink the ellipsis, and the cursor is back at . [cursor] ... element_t <-> element_t "" The value of symstr is the ending symbol of the ellipsis so it's . We add that onto the list: [cursor] ... element_t <-> element_t <-> element_t "" "" However, then we set the cursor on purpose because we want to insert symbols between the start and end elements: 1092 /* Reset the cursor. */ 1093 collate->cursor = startp; [cursor] ... element_t <-> element_t <-> element_t "" "" startp endp Then we proceed to add elements *after* the cursor. We iterate from U0001 to U007E, adding entries. 1451 /* Enqueue the new element. */ 1452 elem->last = collate->cursor; 1453 elem->next = collate->cursor->next; 1454 elem->last->next = elem; 1455 if (elem->next != NULL) 1456 elem->next->last = elem; 1457 collate->cursor = elem; This code inserts the new entry after the cursor, but before the real end of the ellipsis: [cursor] ... element_t <-> element_t <-> element_t <-> element_t "" "" "" startp endp At the end of the function we have: [cursor] ... element_t <-> element_t <-> element_t "" "" endp The cursor should be pointing at endp, the last element in the double linked list, otherwise when we come back to the caller we will start inserting the next line after . The stack at this point looks like this: #0 handle_ellipsis (ldfile=, ldfile@entry=0x5555555a2e80, symstr=, symstr@entry=0x7fffffffd080 "U0000007F", symlen=, symlen@entry=9, ellipsis=, ellipsis@entry=tok_ellipsis2, charmap=, charmap@entry=0x5555555a2fd0, repertoire=, repertoire@entry=0x0, result=) at programs/ld-collate.c:1488 #1 0x000055555557d424 in collate_read (ldfile=, result=0x7fffffffd170, charmap=0x5555555a2fd0, repertoire_name=0x0, ignore_content=0) at programs/ld-collate.c:3670 #2 0x000055555558436d in locfile_read (result=0x7fffffffd170, charmap=0x5555555a2fd0) at programs/locfile.c:180 #3 0x000055555555da3f in main (argc=, argv=0x7fffffffd3f8) at programs/localedef.c:263 We return to collate_read, break, and read more data: 3963 /* Prepare for the next round. */ 3964 now = lr_token (ldfile, charmap, result, NULL, verbose); 3965 nowtok = now->tok; We read the *next* ellipsis start symbol, remember that. We read the *next* ellipsis "..." (tok_ellipsis2), remember that. We read the *next* ellipsis end symbol, and this triggers handle_ellipsis. All this time insert_value()->insert_weights() is using collate->cursor to insert values immediately after , and the (end of ellipsis) is still at the end of the doubly-linked list. When I finish parsing C.UTF-8 I have a list that ends like this: <-> ... <-> <-> <-> <-> With each tok_ellipsis2 failing to reset the cursor, and so leaving the trailing end of the last ellipsis in the doubly-linked list to get assigned a weight in that order. Does that clarify the fix? Shall I add more comments about the cursor handling? -- Cheers, Carlos.