From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 8073 invoked by alias); 13 Jun 2017 16:17:15 -0000 Mailing-List: contact elfutils-devel-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Post: List-Help: List-Subscribe: Sender: elfutils-devel-owner@sourceware.org Received: (qmail 8014 invoked by uid 89); 13 Jun 2017 16:17:14 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Checked: by ClamAV 0.99.2 on sourceware.org X-Virus-Found: No X-Spam-SWARE-Status: No, score=-0.9 required=5.0 tests=BAYES_00,KAM_LAZY_DOMAIN_SECURITY autolearn=no version=3.3.2 spammy=H*M:252, retry, H*F:U*mark, our X-Spam-Status: No, score=-0.9 required=5.0 tests=BAYES_00,KAM_LAZY_DOMAIN_SECURITY autolearn=no version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on sourceware.org X-Spam-Level: X-HELO: gnu.wildebeest.org Received: from wildebeest.demon.nl (HELO gnu.wildebeest.org) (212.238.236.112) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 13 Jun 2017 16:17:13 +0000 Received: from tarox.wildebeest.org (tarox.wildebeest.org [172.31.17.39]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by gnu.wildebeest.org (Postfix) with ESMTPSA id D23AA3000AE0; Tue, 13 Jun 2017 18:17:15 +0200 (CEST) Received: by tarox.wildebeest.org (Postfix, from userid 1000) id BCE3F4021689; Tue, 13 Jun 2017 18:17:15 +0200 (CEST) Message-ID: <1497370635.3755.252.camel@klomp.org> Subject: Re: How to debug broken unwinding? From: Mark Wielaard To: Milian Wolff Cc: elfutils-devel@sourceware.org Date: Wed, 14 Jun 2017 18:27:00 -0000 In-Reply-To: <44012995.nqRUhLey2r@agathebauer> References: <2244170.M6KXmO4ZXp@agathebauer> <4288473.shuf9ekOC1@agathebauer> <44012995.nqRUhLey2r@agathebauer> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Mailer: Evolution 3.12.11 (3.12.11-22.el7) Mime-Version: 1.0 X-IsSubscribed: yes X-SW-Source: 2017-q2/txt/msg00247.txt.bz2 Hi Milian, On Fri, 2017-06-02 at 17:03 +0200, Milian Wolff wrote: > Some more debugging and going after my gut feeling brings me to the follo= wing=20 > conclusion: The real issue seems to be the on-demand reporting of the elf= =20 > file. We used to do: >=20 > Dwarf_Addr pc; > bool isactivation; >=20=20 > if (!dwfl_frame_pc(state, &pc, &isactivation)) { > pr_err("%s", dwfl_errmsg(-1)); > return DWARF_CB_ABORT; > } >=20 > // report the module before we query for isactivation > report_module(pc, ui); >=20 > This looks safe and fine and actually works most of the time. But passing= a=20 > non-null isactivation flag to dwfl_frame_pc potentially leads to a second= =20 > unwind step, before we got the change to report the module! I can workaro= und=20 > this by instead doing >=20 > Dwarf_Addr pc; > bool isactivation; >=20=20 > if (!dwfl_frame_pc(state, &pc, NULL)) { > pr_err("%s", dwfl_errmsg(-1)); > return DWARF_CB_ABORT; > } >=20 > // report the module before we query for isactivation > report_module(pc, ui); >=20 > if (!dwfl_frame_pc(state, &pc, &isactivation)) { > pr_err("%s", dwfl_errmsg(-1)); > return DWARF_CB_ABORT; > } >=20 > This fixes all the issues in my original email. So sorry for the noise - = it=20 > doesn't see as if the unwinding code in elfutils is broken - quite the=20 > contrary! It's just our misuse of the API that is to blame. That is some very nice detective work. The isactivation code is tricky. The logic is: isactivation =3D (frame is initial_frame || frame is signal_frame || unwind_frame is signal_frame) The idea being that normally a frame isn't an activation. So the given pc is really the return address, not the call address. But if it is the initial frame or a frame caused by a signal, then the pc really is where the code was when this frame was active. The tricky case is that last case. If the previous frame is a signal frame then this frame is also an activation (so you don't need to subtract 1 to get the actual address that was executing). I am not sure I fully understand anymore why that is. This must be a frame that is itself not an initial or signal frame, but that is called (activated) from a signal frame. But assuming that is the correct semantics then your original observation seems to explain what is going on. The code doesn't account for the fact that during an unwind other modules can be reported that might make it able to unwind the current frame. And it caches the failure state of the unwind of the current frame, so it doesn't retry when asked. I think we could reset the state of the current frame if any more modules are reported. But is it possible for your parser or perf to report modules as soon as it knows about them, not lazily during an unwind? Is reporting lazily at the last possible moment an optimization that really helps? I would assume that you end up reporting all modules anyway. So it actually seems easier to report everything upfront. > May I suggest though to move the isactivation code into a separate functi= on to=20 > prevent this issue from arising in the future? I.e. it would be nice if t= he=20 > code above could read: >=20 >=20 > Dwarf_Addr pc; > bool isactivation; >=20=20 > if (!dwfl_frame_pc(state, &pc)) { > pr_err("%s", dwfl_errmsg(-1)); > return DWARF_CB_ABORT; > } >=20 > // report the module before we query for isactivation > report_module(pc, ui); >=20 > if (!dwfl_frame_is_activation(state)) { > --pc;=20 > } That is a good suggestion in any case. Maybe also introduce dwfl_frame_is_signal_frame () and dwfl_frame_is_initial_frame () in case the user wants to know the exact details. Thanks, Mark