From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from tneumann.de (tneumann.de [5.45.106.102]) by sourceware.org (Postfix) with ESMTPS id 612E8388B01E for ; Mon, 11 May 2020 08:14:38 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 612E8388B01E Received: from [IPv6:2003:ee:af11:5f00:788a:75b0:98f1:a49e] (p200300EEAF115F00788A75B098F1A49E.dip0.t-ipconnect.de [IPv6:2003:ee:af11:5f00:788a:75b0:98f1:a49e]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by tneumann.de (Postfix) with ESMTPSA id 9C52111F777 for ; Mon, 11 May 2020 10:14:36 +0200 (CEST) To: gcc@gcc.gnu.org From: Thomas Neumann Subject: performance of exception handling Message-ID: <0bbdaab7-c083-e14e-6227-27713dab9657@users.sourceforge.net> Date: Mon, 11 May 2020 10:14:36 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.7.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-0.8 required=5.0 tests=BAYES_00, JMQ_SPF_ALL, KAM_DMARC_STATUS, KAM_SHORT, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=no autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 11 May 2020 08:14:48 -0000 Hi, I want to improve the performance of C++ exception handling, and I would like to get some feedback on how to tackle that. Currently, exception handling scales poorly due to global mutexes when throwing. This can be seen with a small demo script here: https://repl.it/repls/DeliriousPrivateProfiler Using a thread count >1 is much slower than running single threaded. This global locking is particular painful on a machine with more than a hundred cores, as there mutexes are expensive and contention becomes much more likely due to the high degree of parallelism. Of course conventional wisdom is not to use exceptions when exceptions can occur somewhat frequently. But I think that is a silly argument, see the WG21 paper P0709 for a detailed discussion. In particular since there is no technical reason why they have to be slow, it is just the current implementation that is slow. In the current gcc implementation on Linux the bottleneck is _Unwind_Find_FDE, or more precisely, the function dl_iterate_phdr, that is called for every frame and that iterates over all shared libraries while holding a global lock. That is inherently slow, both due to global locking and due to the data structures involved. And it is not easy to speed that up with, e.g., a thread local cache, as glibc has no mechanism to notify us if a shared library is added or removed. We therefore need a way to locate the exception frames that is independent from glibc. One way to achieve that would be to explicitly register exception frames with __register_frame_info_bases in a constructor function (and deregister them in a destructor function). Of course probing explicitly registered frame currently uses a global lock, too, but that implementation is provided by libgcc, and we can change that to something better, allowing for lock free reads. In libgcc explicitly registered frames take precedence over the dl_iterate_phdr mechanism, which means that we could mix future code that does call __register_frame_info_bases explicitly with code that does not. Code that does register will unwind faster than code that does not, but both can coexist in one process. Does that sound like a viable strategy to speed up exception handling? I would be willing to contribute code for that, but I first wanted to know if you are interested and if the strategy makes sense. Also, my implementation makes use of atomics, which I hope are available on all platforms that use unwind-dw2-fde.c, but I am not sure. Thomas