Re: [PATCH] libdw: pre-compute leb128 loop limits

public inbox for elfutils@sourceware.org
 help / color / mirror / Atom feed

* Re: [PATCH] libdw: pre-compute leb128 loop limits
@ 2014-12-16 13:15 Mark Wielaard
  0 siblings, 0 replies; 5+ messages in thread
From: Mark Wielaard @ 2014-12-16 13:15 UTC (permalink / raw)
  To: elfutils-devel

[-- Attachment #1: Type: text/plain, Size: 4162 bytes --]

On Mon, 2014-12-15 at 23:03 +0100, Mark Wielaard wrote:
> On Mon, 2014-12-15 at 22:42 +0100, Mark Wielaard wrote:
> > On Mon, 2014-12-15 at 12:18 -0800, Josh Stone wrote:
> > > On Fedora 21, this appears to be slightly faster, although pretty close
> > > to noise levels.  Mark, can you see if this helps the performance slip
> > > on your el7 system?
> > 
> > It is slightly faster ~0.5 secs on ~55 secs.
> 
> Wait, I wasn't testing on an idle system. One of the cores was pretty
> busy (with running a fuzzer...). I retested both the original
> (mjw/pending) and your patch with nothing else eating cpu. Now (best of
> 3) original was 54.90 vs patched 53.90. So a whole second won.
> 
> > >    /* Unrolling 0 like uleb128 didn't prove to benefit optimization.  */
> > > -  for (unsigned int i = 0; i < len_leb128 (acc) && *addrp < end; ++i)
> > > +  const size_t max = __libdw_max_len_leb128 (*addrp, end);
> > > +  for (size_t i = 0; i < max; ++i)
> > >      get_sleb128_step (acc, *addrp, i);
> > >    /* Other implementations set VALUE to INT_MAX in this
> > >       case.  So we better do this as well.  */
> > 
> > Unrolling this does seem to give an addition ~0.2 seconds win.
> 
> Adding unrolling now (same idle system, best out of 3) gives me 54.28.
> So it does seem like another slight 0.6 second win.

I retested again and at least for my rhel7 setup both patches seem to
bring real wins. For my f21 setup things aren't so clear, the wins are
less than 0.5 seconds for both when taking the best of 3 runs. But I
don't fully trust my benchmarking on the f21 machine setup since some of
those runs show differences of several seconds with the worse being
70.62 seconds, and the best 66.40 seconds (without patches applied). So
something feels fishy about that machine.

But when everything is applied on average the new code is as fast as
0.160. Best run for 0.160 was 66.54, for an git master plus extra leb128
checking and performance tuning patches it was 66.03. For my rhel7 setup
best run for 0.160 was 53.63 seconds and with patches 53.93.

So at least compared to 0.160 we are not slower (although 0.160 was
slower than 0.158). But we do have a lot more robustness checking.

I would like to push the following patches, currently on mjw/pending, to
master:

commit 0f512f1201dc606fb1793f4a596d6b773033b10e
Author: Mark Wielaard <mjw@redhat.com>
Date:   Tue Dec 16 10:53:22 2014 +0100

    libdw: Unroll the first get_sleb128 step to help the compiler optimize.
    
    The common case is a single-byte. So no extra (max len) calculation is
    necessary then.
    
    Signed-off-by: Mark Wielaard <mjw@redhat.com>

commit 56698acce00bd0cbc8ace327263dfe147fae18fa
Author: Josh Stone <jistone@redhat.com>
Date:   Mon Dec 15 12:18:25 2014 -0800

    libdw: pre-compute leb128 loop limits
    
    Signed-off-by: Josh Stone <jistone@redhat.com>

commit 1df99d104efaa7d0b824f0761493010027a7303e
Author: Mark Wielaard <mjw@redhat.com>
Date:   Sun Dec 14 21:48:23 2014 +0100

    libdw: Add get_uleb128 and get_sleb128 bounds checking.
    
    Both get_uleb128 and get_sleb128 now take an end pointer to prevent
    reading too much data. Adjust all callers to provide the end pointer.
    
    There are still two exceptions. "Raw" dwarf_getabbrevattr and
    read_encoded_valued don't have a end pointer associated yet.
    They will have to be provided in the future.
    
    Signed-off-by: Mark Wielaard <mjw@redhat.com>

commit 1e777486d0c0191c7cc3adc29738e51f348cf039
Author: Mark Wielaard <mjw@redhat.com>
Date:   Fri Dec 12 16:43:04 2014 +0100

    libdw: Make sure all attributes come with a (fake) CU for bound checks.
    
    All attributes now have a reference to a (fake) CU that has startp and
    endp set to the data section where the form data comes from. Use that
    for bounds checking in __libdw_form_val_len and dwarf_formblock to make
    sure data read doesn't overflow any data section. Remove libdwP.h cu_data
    and use cu startp and endp directly where appropriate.
    
    Signed-off-by: Mark Wielaard <mjw@redhat.com>


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] libdw: pre-compute leb128 loop limits
@ 2014-12-17 15:37 Mark Wielaard
  0 siblings, 0 replies; 5+ messages in thread
From: Mark Wielaard @ 2014-12-17 15:37 UTC (permalink / raw)
  To: elfutils-devel

[-- Attachment #1: Type: text/plain, Size: 2295 bytes --]

On Tue, 2014-12-16 at 14:15 +0100, Mark Wielaard wrote:
> So at least compared to 0.160 we are not slower (although 0.160 was
> slower than 0.158). But we do have a lot more robustness checking.
> 
> I would like to push the following patches, currently on mjw/pending, to
> master:
> 
> commit 0f512f1201dc606fb1793f4a596d6b773033b10e
> Author: Mark Wielaard <mjw@redhat.com>
> Date:   Tue Dec 16 10:53:22 2014 +0100
> 
>     libdw: Unroll the first get_sleb128 step to help the compiler optimize.
>     
>     The common case is a single-byte. So no extra (max len) calculation is
>     necessary then.
>     
>     Signed-off-by: Mark Wielaard <mjw@redhat.com>
> 
> commit 56698acce00bd0cbc8ace327263dfe147fae18fa
> Author: Josh Stone <jistone@redhat.com>
> Date:   Mon Dec 15 12:18:25 2014 -0800
> 
>     libdw: pre-compute leb128 loop limits
>     
>     Signed-off-by: Josh Stone <jistone@redhat.com>
> 
> commit 1df99d104efaa7d0b824f0761493010027a7303e
> Author: Mark Wielaard <mjw@redhat.com>
> Date:   Sun Dec 14 21:48:23 2014 +0100
> 
>     libdw: Add get_uleb128 and get_sleb128 bounds checking.
>     
>     Both get_uleb128 and get_sleb128 now take an end pointer to prevent
>     reading too much data. Adjust all callers to provide the end pointer.
>     
>     There are still two exceptions. "Raw" dwarf_getabbrevattr and
>     read_encoded_valued don't have a end pointer associated yet.
>     They will have to be provided in the future.
>     
>     Signed-off-by: Mark Wielaard <mjw@redhat.com>
> 
> commit 1e777486d0c0191c7cc3adc29738e51f348cf039
> Author: Mark Wielaard <mjw@redhat.com>
> Date:   Fri Dec 12 16:43:04 2014 +0100
> 
>     libdw: Make sure all attributes come with a (fake) CU for bound checks.
>     
>     All attributes now have a reference to a (fake) CU that has startp and
>     endp set to the data section where the form data comes from. Use that
>     for bounds checking in __libdw_form_val_len and dwarf_formblock to make
>     sure data read doesn't overflow any data section. Remove libdwP.h cu_data
>     and use cu startp and endp directly where appropriate.
>     
>     Signed-off-by: Mark Wielaard <mjw@redhat.com>

I just did push all the above patches to master.

Cheers,

Mark

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] libdw: pre-compute leb128 loop limits
@ 2014-12-15 22:03 Mark Wielaard
  0 siblings, 0 replies; 5+ messages in thread
From: Mark Wielaard @ 2014-12-15 22:03 UTC (permalink / raw)
  To: elfutils-devel

[-- Attachment #1: Type: text/plain, Size: 1253 bytes --]

On Mon, 2014-12-15 at 22:42 +0100, Mark Wielaard wrote:
> On Mon, 2014-12-15 at 12:18 -0800, Josh Stone wrote:
> > On Fedora 21, this appears to be slightly faster, although pretty close
> > to noise levels.  Mark, can you see if this helps the performance slip
> > on your el7 system?
> 
> It is slightly faster ~0.5 secs on ~55 secs.

Wait, I wasn't testing on an idle system. One of the cores was pretty
busy (with running a fuzzer...). I retested both the original
(mjw/pending) and your patch with nothing else eating cpu. Now (best of
3) original was 54.90 vs patched 53.90. So a whole second won.

> >    /* Unrolling 0 like uleb128 didn't prove to benefit optimization.  */
> > -  for (unsigned int i = 0; i < len_leb128 (acc) && *addrp < end; ++i)
> > +  const size_t max = __libdw_max_len_leb128 (*addrp, end);
> > +  for (size_t i = 0; i < max; ++i)
> >      get_sleb128_step (acc, *addrp, i);
> >    /* Other implementations set VALUE to INT_MAX in this
> >       case.  So we better do this as well.  */
> 
> Unrolling this does seem to give an addition ~0.2 seconds win.

Adding unrolling now (same idle system, best out of 3) gives me 54.28.
So it does seem like another slight 0.6 second win.

Cheers,

Mark

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] libdw: pre-compute leb128 loop limits
@ 2014-12-15 21:42 Mark Wielaard
  0 siblings, 0 replies; 5+ messages in thread
From: Mark Wielaard @ 2014-12-15 21:42 UTC (permalink / raw)
  To: elfutils-devel

[-- Attachment #1: Type: text/plain, Size: 874 bytes --]

On Mon, 2014-12-15 at 12:18 -0800, Josh Stone wrote:
> On Fedora 21, this appears to be slightly faster, although pretty close
> to noise levels.  Mark, can you see if this helps the performance slip
> on your el7 system?

It is slightly faster ~0.5 secs on ~55 secs.

>    /* Unrolling 0 like uleb128 didn't prove to benefit optimization.  */
> -  for (unsigned int i = 0; i < len_leb128 (acc) && *addrp < end; ++i)
> +  const size_t max = __libdw_max_len_leb128 (*addrp, end);
> +  for (size_t i = 0; i < max; ++i)
>      get_sleb128_step (acc, *addrp, i);
>    /* Other implementations set VALUE to INT_MAX in this
>       case.  So we better do this as well.  */

Unrolling this does seem to give an addition ~0.2 seconds win.
So both are mostly in the noise, but they do seem to help a little.

But not as much as upgrading GCC :)

Cheers,

Mark

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH] libdw: pre-compute leb128 loop limits
@ 2014-12-15 20:18 Josh Stone
  0 siblings, 0 replies; 5+ messages in thread
From: Josh Stone @ 2014-12-15 20:18 UTC (permalink / raw)
  To: elfutils-devel

[-- Attachment #1: Type: text/plain, Size: 2676 bytes --]

Signed-off-by: Josh Stone <jistone@redhat.com>
---

On Fedora 21, this appears to be slightly faster, although pretty close
to noise levels.  Mark, can you see if this helps the performance slip
on your el7 system?

---
 libdw/ChangeLog       |  6 ++++++
 libdw/memory-access.h | 17 +++++++++++++++--
 2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/libdw/ChangeLog b/libdw/ChangeLog
index 2aa878bd8ad7..17b4db980278 100644
--- a/libdw/ChangeLog
+++ b/libdw/ChangeLog
@@ -1,3 +1,9 @@
+2014-12-15  Josh Stone  <jistone@redhat.com>
+
+	* memory-access.h (__libdw_max_len_leb128): New.
+	(__libdw_get_uleb128): Use __libdw_max_len_leb128.
+	(__libdw_get_sleb128): Likewise.
+
 2014-12-15  Mark Wielaard  <mjw@redhat.com>
 
 	* dwarf_getpubnames.c (get_offsets): Make sure whole unit fall inside
diff --git a/libdw/memory-access.h b/libdw/memory-access.h
index 8226d00e9b33..99c827af22bd 100644
--- a/libdw/memory-access.h
+++ b/libdw/memory-access.h
@@ -39,6 +39,14 @@
 
 #define len_leb128(var) ((8 * sizeof (var) + 6) / 7)
 
+static inline size_t
+__libdw_max_len_leb128 (const unsigned char *addr, const unsigned char *end)
+{
+  const size_t type_len = len_leb128 (uint64_t);
+  const size_t pointer_len = likely (addr < end) ? end - addr : 0;
+  return likely (type_len <= pointer_len) ? type_len : pointer_len;
+}
+
 #define get_uleb128_step(var, addr, nth)				      \
   do {									      \
     unsigned char __b = *(addr)++;					      \
@@ -51,10 +59,13 @@ static inline uint64_t
 __libdw_get_uleb128 (const unsigned char **addrp, const unsigned char *end)
 {
   uint64_t acc = 0;
+
   /* Unroll the first step to help the compiler optimize
      for the common single-byte case.  */
   get_uleb128_step (acc, *addrp, 0);
-  for (unsigned int i = 1; i < len_leb128 (acc) && *addrp < end; ++i)
+
+  const size_t max = __libdw_max_len_leb128 (*addrp - 1, end);
+  for (size_t i = 1; i < max; ++i)
     get_uleb128_step (acc, *addrp, i);
   /* Other implementations set VALUE to UINT_MAX in this
      case.  So we better do this as well.  */
@@ -82,8 +93,10 @@ static inline int64_t
 __libdw_get_sleb128 (const unsigned char **addrp, const unsigned char *end)
 {
   int64_t acc = 0;
+
   /* Unrolling 0 like uleb128 didn't prove to benefit optimization.  */
-  for (unsigned int i = 0; i < len_leb128 (acc) && *addrp < end; ++i)
+  const size_t max = __libdw_max_len_leb128 (*addrp, end);
+  for (size_t i = 0; i < max; ++i)
     get_sleb128_step (acc, *addrp, i);
   /* Other implementations set VALUE to INT_MAX in this
      case.  So we better do this as well.  */
-- 
2.1.0


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2014-12-17 15:37 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-12-16 13:15 [PATCH] libdw: pre-compute leb128 loop limits Mark Wielaard
  -- strict thread matches above, loose matches on Subject: below --
2014-12-17 15:37 Mark Wielaard
2014-12-15 22:03 Mark Wielaard
2014-12-15 21:42 Mark Wielaard
2014-12-15 20:18 Josh Stone

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).