From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <binutils-return-109244-listarch-binutils=sources.redhat.com@sourceware.org>
Received: (qmail 27649 invoked by alias); 3 Mar 2020 18:39:34 -0000
Mailing-List: contact binutils-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <binutils.sourceware.org>
List-Subscribe: <mailto:binutils-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/binutils/>
List-Post: <mailto:binutils@sourceware.org>
List-Help: <mailto:binutils-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: binutils-owner@sourceware.org
Received: (qmail 26903 invoked by uid 89); 3 Mar 2020 18:39:34 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-1.8 required=5.0 tests=AWL,BAYES_00,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.1 spammy=nasty
X-HELO: psionic.psi5.com
Received: from psionic.psi5.com (HELO psionic.psi5.com) (212.83.56.200) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 03 Mar 2020 18:39:32 +0000
Received: by psionic.psi5.com (Postfix, from userid 1002)	id 24E25E10A2; Tue,  3 Mar 2020 19:39:29 +0100 (CET)
Date: Tue, 03 Mar 2020 18:39:00 -0000
From: Simon Richter <Simon.Richter@hogyros.de>
To: binutils@sourceware.org
Subject: Token-level mapping of coverage information and generated code
Message-ID: <20200303183928.GA8300@psi5.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.5.21 (2010-09-15)
X-IsSubscribed: yes
X-SW-Source: 2020-03/txt/msg00051.txt

Hi,

I'd like to get finer-than-line-level information for code coverage and
optimized-out code.

Consider:

    extern void foo(void);                              // 1
    int test()                                          // 2
    {                                                   // 3
        int a = 0, b = 0, c = 1, d = 0;                 // 4
        if( a == b && a == c && b == c) { d = a; }      // 5
        foo();                                          // 6
        return d;                                       // 7
    }                                                   // 8

Compiling with gcc -c -O0 and mapping back to source, I get

    int test()
    {
       0:   55                      push   %rbp
       1:   48 89 e5                mov    %rsp,%rbp
       4:   48 83 ec 10             sub    $0x10,%rsp
        int a = 0, b = 0, c = 1, d = 0;
       8:   c7 45 f8 00 00 00 00    movl   $0x0,-0x8(%rbp)
       f:   c7 45 f4 00 00 00 00    movl   $0x0,-0xc(%rbp)
      16:   c7 45 f0 01 00 00 00    movl   $0x1,-0x10(%rbp)
      1d:   c7 45 fc 00 00 00 00    movl   $0x0,-0x4(%rbp)
    
        if( a == b && a == c && b == c) { d = a; }
      24:   8b 45 f8                mov    -0x8(%rbp),%eax
      27:   3b 45 f4                cmp    -0xc(%rbp),%eax
      2a:   75 16                   jne    42 <test+0x42>
      2c:   8b 45 f8                mov    -0x8(%rbp),%eax
      2f:   3b 45 f0                cmp    -0x10(%rbp),%eax
      32:   75 0e                   jne    42 <test+0x42>
      34:   8b 45 f4                mov    -0xc(%rbp),%eax
      37:   3b 45 f0                cmp    -0x10(%rbp),%eax
      3a:   75 06                   jne    42 <test+0x42>
      3c:   8b 45 f8                mov    -0x8(%rbp),%eax
      3f:   89 45 fc                mov    %eax,-0x4(%rbp)
    
        foo();
      42:   e8 00 00 00 00          callq  47 <test+0x47>
                            43: R_X86_64_PLT32      foo-0x4
    
        return d;
      47:   8b 45 fc                mov    -0x4(%rbp),%eax
    }
      4a:   c9                      leaveq 
      4b:   c3                      retq   

The finest resolution I can get here is a single line, addr2line reports
the exact same mapping for instruction-to-source-line.

Instrumenting for code coverage and running, I get
    
            1:    2:int test()
            -:    3:{
            1:    4:    int a = 0, b = 0, c = 1, d = 0;
           1*:    5:    if( a == b && a == c && b == c) { d = a; }
            1:    5-block  0
            1:    5-block  1
        %%%%%:    5-block  2
        %%%%%:    5-block  3
            1:    6:    foo();
            1:    6-block  0
            1:    7:    return d;
            -:    8:}

As expected, the condition is resolved into four basic blocks,
corresponding to the three tests and the conditional body. Can I somehow
map these basic blocks back to the tokens in the source file?

Similarly, if I compile with optimization enabled, mapping back to source
code gives me

    int test()
    {
       0:   48 83 ec 08             sub    $0x8,%rsp
        int a = 0, b = 0, c = 1, d = 0;
        if( a == b && a == c && b == c) { d = a; }
        foo();
       4:   e8 00 00 00 00          callq  9 <test+0x9>
                            5: R_X86_64_PLT32       foo-0x4
        return d;
    }
       9:   31 c0                   xor    %eax,%eax
       b:   48 83 c4 08             add    $0x8,%rsp
       f:   c3                      retq   

I can get a bit better mapping information by interrogating addr2line to
see what source code lines actually contributed to the output:

    $ python -c 'for x in range(0, 16): print hex(x)' | \
        addr2line -e test.o | \
        cut -d: -f2 | \
        uniq
    3
    6
    8

This does omit the initialization of d, but I guess that can't be helped
since it's propagated into the return statement as a constant, which is
probably not that relevant a problem for the real world.

Again, I'd like to get a finer-grained mapping than lines here, so I can
highlight in the source code which code actually got used in the final
output.

As a nasty hack, I can run the source code through "tr ' ' '\n'" before
compiling, which gives me rather good resolution for the coverage test, but
the mapping to subexpressions is somewhat arbitrary, because counters are
associated with control flow inside the expression 

            1:   28:if(
            -:   29:a
            -:   30:==
            -:   31:b
            1:   32:&&
            -:   33:a
            -:   34:==
            -:   35:c
        #####:   36:&&
            -:   37:b
            -:   38:==
            -:   39:c)
            -:   40:{
            -:   41:d
        #####:   42:=
            -:   43:a;
            -:   44:}

Is there some way I could accurately extract information from a run that
allows me to highlight which subexpressions hve been evaluated?

>From the run above, I can possibly get

        if( a == b && a == c && b == c) { d = a; }
        ~~~~~~~~~~ ~~~~~~~~~ ~~~~~~~~~~~~~~ ~~~~~~
        1          1         -              -

which isn't bad, but it could probably be improved. The end goal is to
build reports

    "this condition has not been touched by a testcase"
and
    "this code is unused and the compiler can prove it"

   Simon