From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=zlAA=3W=in.tum.de=thomas.neumann@sourceware.org>
Received: from mailout2.rbg.tum.de (mailout2.rbg.tum.de [IPv6:2a09:80c0::202])
	by sourceware.org (Postfix) with ESMTPS id C6A75385843E
	for <gcc-patches@gcc.gnu.org>; Tue, 22 Nov 2022 00:31:09 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org C6A75385843E
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=in.tum.de
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=in.tum.de
Received: from mailrelay1.rbg.tum.de (mailrelay1.in.tum.de [IPv6:2a09:80c0:254::14])
	by mailout2.rbg.tum.de (Postfix) with ESMTPS id F11524C01F4;
	Tue, 22 Nov 2022 01:31:04 +0100 (CET)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=in.tum.de;
	s=20220209; t=1669077065;
	bh=BFqq/PJhr/z3QqTrsaoHw8d+I9v5KCHqbHtv30TyelQ=;
	h=Date:Subject:To:Cc:References:From:In-Reply-To:From;
	b=aLG1Ou5bZfxxZfqI7dPXjr7lGAr20gWq6bnTdrziV4XouY7WbGJK0eCGDNtJsbpG5
	 kfSCpWL6If7tv7rgzM8/ZWqIuMv8sX+/+5C/3z5aqwCug4XrDQcpuXdVkbhWgWDUCH
	 r0oPSRdj8H7CM7rRV7kZBJxMLTIH5sV2GQ1VE1lloyY1Kv7sVRcmCEtF07jQ3mwIRo
	 HtoPFlEGFB6Dol1vzWO6ztYC0CEDmCBx6v+sNJ/Engd4vsgERqKfiQ9NgOp1rLncuu
	 SU7YS9rdDzAkgQOVGHSHVKbv718AECI5dZOEH54KEaEpM+l5Bj4Vm88wK8QR2jJl8L
	 D1MWcE4fdLqOQ==
Received: by mailrelay1.rbg.tum.de (Postfix, from userid 112)
	id E868ADD; Tue, 22 Nov 2022 01:31:04 +0100 (CET)
Received: from mailrelay1.rbg.tum.de (localhost [127.0.0.1])
	by mailrelay1.rbg.tum.de (Postfix) with ESMTP id 95A0E85;
	Tue, 22 Nov 2022 01:31:04 +0100 (CET)
Received: from mail.in.tum.de (vmrbg426.in.tum.de [131.159.0.73])
	by mailrelay1.rbg.tum.de (Postfix) with ESMTPS id 9017B22;
	Tue, 22 Nov 2022 01:31:04 +0100 (CET)
Received: by mail.in.tum.de (Postfix, from userid 112)
	id 87D1F4A021A; Tue, 22 Nov 2022 01:31:04 +0100 (CET)
Received: (Authenticated sender: neumann)
	by mail.in.tum.de (Postfix) with ESMTPSA id 7ECBF4A0113;
	Tue, 22 Nov 2022 01:31:03 +0100 (CET)
	(Extended-Queue-bit xtech_em@fff.in.tum.de)
Message-ID: <91045a34-a534-4436-bb06-cac32d797a36@in.tum.de>
Date: Tue, 22 Nov 2022 01:31:02 +0100
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.4.2
Subject: Re: [PATCH v4] eliminate mutex in fast path of __register_frame
To: "H.J. Lu" <hjl.tools@gmail.com>, Jakub Jelinek <jakub@redhat.com>
Cc: Tamar Christina <Tamar.Christina@arm.com>,
 "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>,
 Jason Merrill <jason@redhat.com>, Florian Weimer <fweimer@redhat.com>,
 Jonathan Wakely <jwakely.gcc@gmail.com>
References: <2a4776b9-9271-bb3c-a626-d5ec22dae6f3@in.tum.de>
 <VI1PR08MB532513A4AF91751CAA34A2D7FF0A9@VI1PR08MB5325.eurprd08.prod.outlook.com>
 <b4b30340-0854-e2d2-03ce-a76c5052e335@in.tum.de> <Y3tljCN9GmZXiApb@tucnak>
 <CAMe9rOoAwZkJjPP23DPsP0P3JcZi_SxwFcNOuB1uxxj2rmaqtg@mail.gmail.com>
Content-Language: en-US
From: Thomas Neumann <thomas.neumann@in.tum.de>
In-Reply-To: <CAMe9rOoAwZkJjPP23DPsP0P3JcZi_SxwFcNOuB1uxxj2rmaqtg@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Spam-Status: No, score=-4.6 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Hi,

>>>> When dynamically linking a fast enough machine hides the latency, but when
>>>> Statically linking or on slower devices this change caused a 5x increase in
>>>> Instruction count and 2x increase in cycle count before getting to main.

I have looked at ways to fix that. The problem is that with static 
linking unwinding tables are registered dynamically, and with my patch 
that registration triggers an eager sort of fde lists. While previously 
the lists were sorted when the first exception was thrown. If an 
application throws at least one exception there is no downside in eager 
sorting, but if the application never throws there is overhead.

The obvious way to improve the situation is to make sorting faster. When 
replacing the split+sort+merge logic with a radix sort (which can be 
done without additional memory) we get the following timings for your
#include <cstdio>
int main() {}
example (with git stat -r 200):

# pre-patch version, fast
               0,06 msec task-clock
            272.286      cycles
            464.754      instructions
# post-patch version, slow
               0,21 msec task-clock
            972.876      cycles
          3.079.515      instructions
# +radix sort, in the middle
               0,13 msec task-clock
            604.697      cycles
          1.702.930      instructions

The radix sort clearly improves things, but it does not fully eliminate 
the overhead.

The question is, how much do we care about that situation (i.e., static 
linking, exceptions registered but never thrown). I could change the 
code to recognize three states instead of two: no exceptions registered, 
exceptions register but never thrown, and full exception mode. But that 
would increase code complexity and it would pessimize applications that 
do throw, as we now need more checks to guard against concurrent changes.

It makes the code more complex and a bit slower, which is why I am 
hesistant. But I can implement that if you think that we need that. Or 
we just replace the sort, which is probably a good idea anyway.

Best

Thomas