From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 10449 invoked by alias); 31 Mar 2004 01:46:58 -0000 Mailing-List: contact libc-hacker-help@sources.redhat.com; run by ezmlm Precedence: bulk List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-hacker-owner@sources.redhat.com Received: (qmail 10431 invoked from network); 31 Mar 2004 01:46:58 -0000 Received: from unknown (HELO palrel11.hp.com) (156.153.255.246) by sources.redhat.com with SMTP; 31 Mar 2004 01:46:58 -0000 Received: from hplms2.hpl.hp.com (hplms2.hpl.hp.com [15.0.152.33]) by palrel11.hp.com (Postfix) with ESMTP id 6C54C1C01AAE for ; Tue, 30 Mar 2004 17:46:57 -0800 (PST) Received: from napali.hpl.hp.com (napali.hpl.hp.com [15.4.89.123]) by hplms2.hpl.hp.com (8.12.10/8.12.10/HPL-PA Hub) with ESMTP id i2V1krcJ019754; Tue, 30 Mar 2004 17:46:53 -0800 (PST) Received: from napali.hpl.hp.com (napali [127.0.0.1]) by napali.hpl.hp.com (8.12.11/8.12.11/Debian-1) with ESMTP id i2V1kq8N031254; Tue, 30 Mar 2004 17:46:52 -0800 Received: (from davidm@localhost) by napali.hpl.hp.com (8.12.11/8.12.11/Debian-1) id i2V1kq6m031250; Tue, 30 Mar 2004 17:46:52 -0800 From: David Mosberger MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <16490.8972.453326.222897@napali.hpl.hp.com> Date: Wed, 31 Mar 2004 08:53:00 -0000 To: libc-hacker@sources.redhat.com Cc: davidm@napali.hpl.hp.com Subject: second thoughts on using dl_iterate_phdr() for cache-validation Reply-To: davidm@hpl.hp.com X-URL: http://www.hpl.hp.com/personal/David_Mosberger/ X-SW-Source: 2004-03/txt/msg00148.txt.bz2 Doing some more careful performance analysis with libunwind, I'm finding that the dl_iterate_phdr() call needed to verify that the phdr-list didn't change is rather expensive. Specifically, the time needed to initialize an "unwind cursor" (via unw_init_local()) is as follows: without dl_iterate_phdr() callback (unsafe): 140 ns with dl_iterate_phdr(), without -lpthread: 200 ns with dl_iterate_phdr(), with -lpthread: 300 ns This is rather more than I expected and a slow-down of more than a factor of 2 for multi-threaded apps is bit more than I'm willing to bear since it could really affect the usability of libunwind for things such as allocation tracking or stack-trace-based sampling. The profile for the "without -lpthread" case looks like this: % time self cumul calls self/call tot/call name 60.44 13.05 13.05 101M 129n 204n _ULia64_init_local 17.19 3.71 16.76 99.6M 37.3n 66.7n dl_iterate_phdr 4.65 1.00 17.76 101M 9.97n 9.97n rtld_lock_default_lock_recursive 4.64 1.00 18.77 99.0M 10.1n 10.1n rtld_lock_default_unlock_recursive The profile for the "with -lpthread" case looks like this (this was measured on a different machine, so the total time of 223 ns is not comparable to the 300 ns mentioned above; the relative times are fine, though): % time self cumul calls self/call tot/call name 47.93 11.25 11.25 99.6M 113n 223n _ULia64_init_local 18.35 4.31 15.56 100M 43.0n 43.0n pthread_mutex_lock 11.81 2.77 18.33 100M 27.7n 27.7n __pthread_mutex_unlock_usercnt 11.65 2.73 21.06 100M 27.3n 103n __GI___dl_iterate_phdr For brevity, I didn't include the call-graphs, but they are pretty easy: all calls to dl_iterate_phdr() are indirectly due to the cache-validation done by _ULia64_init_local() and almost all lock-related calls are due to dl_iterate_phdr(). I suppose I could add a libunwind-hack to disable cache-validation, but that seems like a step backward since it would make caching unsafe again. In case it matters, the first profile was obtained with libc v2.3.2 and the second profile was obtained with the CVS libc (as of a few days ago). Can this be improved? --david