From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <elfutils-devel-return-6191-listarch-elfutils-devel=sourceware.org@sourceware.org>
Received: (qmail 8073 invoked by alias); 13 Jun 2017 16:17:15 -0000
Mailing-List: contact elfutils-devel-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <elfutils-devel.sourceware.org>
List-Post: <mailto:elfutils-devel@sourceware.org>
List-Help: <mailto:elfutils-devel-help@sourceware.org>
List-Subscribe: <mailto:elfutils-devel-subscribe@sourceware.org>
Sender: elfutils-devel-owner@sourceware.org
Received: (qmail 8014 invoked by uid 89); 13 Jun 2017 16:17:14 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Checked: by ClamAV 0.99.2 on sourceware.org
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-0.9 required=5.0 tests=BAYES_00,KAM_LAZY_DOMAIN_SECURITY autolearn=no version=3.3.2 spammy=H*M:252, retry, H*F:U*mark, our
X-Spam-Status: No, score=-0.9 required=5.0 tests=BAYES_00,KAM_LAZY_DOMAIN_SECURITY autolearn=no version=3.3.2
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on sourceware.org
X-Spam-Level:
X-HELO: gnu.wildebeest.org
Received: from wildebeest.demon.nl (HELO gnu.wildebeest.org) (212.238.236.112) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 13 Jun 2017 16:17:13 +0000
Received: from tarox.wildebeest.org (tarox.wildebeest.org [172.31.17.39])	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))	(No client certificate requested)	by gnu.wildebeest.org (Postfix) with ESMTPSA id D23AA3000AE0;	Tue, 13 Jun 2017 18:17:15 +0200 (CEST)
Received: by tarox.wildebeest.org (Postfix, from userid 1000)	id BCE3F4021689; Tue, 13 Jun 2017 18:17:15 +0200 (CEST)
Message-ID: <1497370635.3755.252.camel@klomp.org>
Subject: Re: How to debug broken unwinding?
From: Mark Wielaard <mark@klomp.org>
To: Milian Wolff <mail@milianw.de>
Cc: elfutils-devel@sourceware.org
Date: Wed, 14 Jun 2017 18:27:00 -0000
In-Reply-To: <44012995.nqRUhLey2r@agathebauer>
References: <2244170.M6KXmO4ZXp@agathebauer>	 <4288473.shuf9ekOC1@agathebauer> <44012995.nqRUhLey2r@agathebauer>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Mailer: Evolution 3.12.11 (3.12.11-22.el7)
Mime-Version: 1.0
X-IsSubscribed: yes
X-SW-Source: 2017-q2/txt/msg00247.txt.bz2

Hi Milian,

On Fri, 2017-06-02 at 17:03 +0200, Milian Wolff wrote:
> Some more debugging and going after my gut feeling brings me to the follo=
wing=20
> conclusion: The real issue seems to be the on-demand reporting of the elf=
=20
> file. We used to do:
>=20
>        Dwarf_Addr pc;
>        bool isactivation;
>=20=20
>        if (!dwfl_frame_pc(state, &pc, &isactivation)) {
>                pr_err("%s", dwfl_errmsg(-1));
>                return DWARF_CB_ABORT;
>        }
>=20
>        // report the module before we query for isactivation
>        report_module(pc, ui);
>=20
> This looks safe and fine and actually works most of the time. But passing=
 a=20
> non-null isactivation flag to dwfl_frame_pc potentially leads to a second=
=20
> unwind step, before we got the change to report the module! I can workaro=
und=20
> this by instead doing
>=20
>        Dwarf_Addr pc;
>        bool isactivation;
>=20=20
>        if (!dwfl_frame_pc(state, &pc, NULL)) {
>                pr_err("%s", dwfl_errmsg(-1));
>                return DWARF_CB_ABORT;
>        }
>=20
>        // report the module before we query for isactivation
>        report_module(pc, ui);
>=20
>        if (!dwfl_frame_pc(state, &pc, &isactivation)) {
>                pr_err("%s", dwfl_errmsg(-1));
>                return DWARF_CB_ABORT;
>        }
>=20
> This fixes all the issues in my original email. So sorry for the noise - =
it=20
> doesn't see as if the unwinding code in elfutils is broken - quite the=20
> contrary! It's just our misuse of the API that is to blame.

That is some very nice detective work.
The isactivation code is tricky. The logic is:

isactivation =3D (frame is initial_frame
                || frame is signal_frame
		|| unwind_frame is signal_frame)

The idea being that normally a frame isn't an activation.
So the given pc is really the return address, not the call address.
But if it is the initial frame or a frame caused by a signal, then the
pc really is where the code was when this frame was active.

The tricky case is that last case. If the previous frame is a signal
frame then this frame is also an activation (so you don't need to
subtract 1 to get the actual address that was executing). I am not sure
I fully understand anymore why that is. This must be a frame that is
itself not an initial or signal frame, but that is called (activated)
from a signal frame.

But assuming that is the correct semantics then your original
observation seems to explain what is going on. The code doesn't account
for the fact that during an unwind other modules can be reported that
might make it able to unwind the current frame. And it caches the
failure state of the unwind of the current frame, so it doesn't retry
when asked.

I think we could reset the state of the current frame if any more
modules are reported. But is it possible for your parser or perf to
report modules as soon as it knows about them, not lazily during an
unwind? Is reporting lazily at the last possible moment an optimization
that really helps? I would assume that you end up reporting all modules
anyway. So it actually seems easier to report everything upfront.

> May I suggest though to move the isactivation code into a separate functi=
on to=20
> prevent this issue from arising in the future? I.e. it would be nice if t=
he=20
> code above could read:
>=20
>=20
>        Dwarf_Addr pc;
>        bool isactivation;
>=20=20
>        if (!dwfl_frame_pc(state, &pc)) {
>                pr_err("%s", dwfl_errmsg(-1));
>                return DWARF_CB_ABORT;
>        }
>=20
>        // report the module before we query for isactivation
>        report_module(pc, ui);
>=20
>        if (!dwfl_frame_is_activation(state)) {
>            --pc;=20
>        }

That is a good suggestion in any case. Maybe also introduce
dwfl_frame_is_signal_frame () and dwfl_frame_is_initial_frame () in case
the user wants to know the exact details.

Thanks,

Mark