public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
* RE: [PATCH][PING] Improve stpncpy performance
@ 2015-07-06 11:29 Wilco Dijkstra
  2015-07-09 13:21 ` Ondřej Bílka
  0 siblings, 1 reply; 5+ messages in thread
From: Wilco Dijkstra @ 2015-07-06 11:29 UTC (permalink / raw)
  To: libc-alpha

> Wilco wrote:
> ping
> 
> > > -----Original Message-----
> > > From: Wilco Dijkstra [mailto:wdijkstr@arm.com]
> > > Sent: 12 January 2015 15:09
> > > To: 'libc-alpha@sourceware.org'
> > > Subject: [PATCH] Improve stpncpy performance
> > >
> > > Like strncpy, this patch improves stpncpy performance by using strnlen/memcpy/memset
> rather
> > > than a byte loop. Performance on bench-stpncpy is ~2x faster on average.
> > >
> > > ChangeLog:
> > >
> > > 2015-01-12 Wilco Dijkstra wdijkstr@arm.com
> > >
> > > 	* string/stpncpy.c (stpncpy): Improve performance using
> > > 	__strnlen/memcpy/memset.
> > >
> > > ---
> > >  string/stpncpy.c | 59 ++++++--------------------------------------------------
> > >  1 file changed, 6 insertions(+), 53 deletions(-)
> > >
> > > diff --git a/string/stpncpy.c b/string/stpncpy.c
> > > index fad747e..50521aa 100644
> > > --- a/string/stpncpy.c
> > > +++ b/string/stpncpy.c
> > > @@ -15,7 +15,7 @@
> > >     License along with the GNU C Library; if not, see
> > >     <http://www.gnu.org/licenses/>.  */
> > >
> > > -/* This is almost copied from strncpy.c, written by Torbjorn Granlund.  */
> > > +#include <stddef.h>
> > >
> > >  #ifdef HAVE_CONFIG_H
> > >  # include <config.h>
> > > @@ -41,59 +41,12 @@ weak_alias (__stpncpy, stpncpy)
> > >  char *
> > >  STPNCPY (char *dest, const char *src, size_t n)
> > >  {
> > > -  char c;
> > > -  char *s = dest;
> > > -
> > > -  if (n >= 4)
> > > -    {
> > > -      size_t n4 = n >> 2;
> > > -
> > > -      for (;;)
> > > -	{
> > > -	  c = *src++;
> > > -	  *dest++ = c;
> > > -	  if (c == '\0')
> > > -	    break;
> > > -	  c = *src++;
> > > -	  *dest++ = c;
> > > -	  if (c == '\0')
> > > -	    break;
> > > -	  c = *src++;
> > > -	  *dest++ = c;
> > > -	  if (c == '\0')
> > > -	    break;
> > > -	  c = *src++;
> > > -	  *dest++ = c;
> > > -	  if (c == '\0')
> > > -	    break;
> > > -	  if (--n4 == 0)
> > > -	    goto last_chars;
> > > -	}
> > > -      n -= dest - s;
> > > -      goto zero_fill;
> > > -    }
> > > -
> > > - last_chars:
> > > -  n &= 3;
> > > -  if (n == 0)
> > > +  size_t size = __strnlen (src, n);
> > > +  memcpy (dest, src, size);
> > > +  dest += size;
> > > +  if (size == n)
> > >      return dest;
> > > -
> > > -  for (;;)
> > > -    {
> > > -      c = *src++;
> > > -      --n;
> > > -      *dest++ = c;
> > > -      if (c == '\0')
> > > -	break;
> > > -      if (n == 0)
> > > -	return dest;
> > > -    }
> > > -
> > > - zero_fill:
> > > -  while (n-- > 0)
> > > -    dest[n] = '\0';
> > > -
> > > -  return dest - 1;
> > > +  return memset (dest, '\0', n - size);
> > >  }
> > >  #ifdef weak_alias
> > >  libc_hidden_def (__stpncpy)
> > > --
> > > 1.9.1



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH][PING] Improve stpncpy performance
  2015-07-06 11:29 [PATCH][PING] Improve stpncpy performance Wilco Dijkstra
@ 2015-07-09 13:21 ` Ondřej Bílka
  2015-07-09 15:02   ` Wilco Dijkstra
  0 siblings, 1 reply; 5+ messages in thread
From: Ondřej Bílka @ 2015-07-09 13:21 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: libc-alpha

On Mon, Jul 06, 2015 at 12:29:24PM +0100, Wilco Dijkstra wrote:
> > Wilco wrote:
> > ping
> > 
> > > > -----Original Message-----
> > > > From: Wilco Dijkstra [mailto:wdijkstr@arm.com]
> > > > Sent: 12 January 2015 15:09
> > > > To: 'libc-alpha@sourceware.org'
> > > > Subject: [PATCH] Improve stpncpy performance
> > > >
> > > > Like strncpy, this patch improves stpncpy performance by using strnlen/memcpy/memset
> > rather
> > > > than a byte loop. Performance on bench-stpncpy is ~2x faster on average.
> > > >
> > > > +  size_t size = __strnlen (src, n);
> > > > +  memcpy (dest, src, size);
> > > > +  dest += size;
> > > > +  if (size == n)
> > > >      return dest;
> > > > -
> > > > -  for (;;)
> > > > -    {
> > > > -      c = *src++;
> > > > -      --n;
> > > > -      *dest++ = c;
> > > > -      if (c == '\0')
> > > > -	break;
> > > > -      if (n == 0)
> > > > -	return dest;
> > > > -    }
> > > > -
> > > > - zero_fill:
> > > > -  while (n-- > 0)
> > > > -    dest[n] = '\0';
> > > > -
> > > > -  return dest - 1;
> > > > +  return memset (dest, '\0', n - size);
> > > >  }
> > > >  #ifdef weak_alias
> > > >  libc_hidden_def (__stpncpy)
> > > > --
> > > > 1.9.1
> 
You don't have to use special case

if (size == n)
  return dest;

as it should be handled by

return memset (dest, '\0', 0);

That could improve performance a bit if its rare case. That doesn't
matter much as memset makes that function slow and it shouldn't be 
used in performance sensitive code.

Otherwise ok for me.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [PATCH][PING] Improve stpncpy performance
  2015-07-09 13:21 ` Ondřej Bílka
@ 2015-07-09 15:02   ` Wilco Dijkstra
  2015-08-12 20:57     ` Ondřej Bílka
  0 siblings, 1 reply; 5+ messages in thread
From: Wilco Dijkstra @ 2015-07-09 15:02 UTC (permalink / raw)
  To: 'Ondřej Bílka'; +Cc: libc-alpha

> Ondřej Bílka wrote:
>
> You don't have to use special case
> 
> if (size == n)
>   return dest;
> 
> as it should be handled by
> 
> return memset (dest, '\0', 0);
> 
> That could improve performance a bit if its rare case. That doesn't
> matter much as memset makes that function slow and it shouldn't be
> used in performance sensitive code.
> 
> Otherwise ok for me.

On the benchtests the extra if made a significant difference, particularly
since memset of 0 is relatively expensive as it is being regarded as a very
rare case. It seems it should be less likely than the benchtests indicate,
but we'd have to fix the benchtest first to use more realistic data.

Wilco


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH][PING] Improve stpncpy performance
  2015-07-09 15:02   ` Wilco Dijkstra
@ 2015-08-12 20:57     ` Ondřej Bílka
  2015-08-19 16:15       ` Wilco Dijkstra
  0 siblings, 1 reply; 5+ messages in thread
From: Ondřej Bílka @ 2015-08-12 20:57 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: libc-alpha

On Thu, Jul 09, 2015 at 04:02:26PM +0100, Wilco Dijkstra wrote:
> > Ondřej Bílka wrote:
> >
> > You don't have to use special case
> > 
> > if (size == n)
> >   return dest;
> > 
> > as it should be handled by
> > 
> > return memset (dest, '\0', 0);
> > 
> > That could improve performance a bit if its rare case. That doesn't
> > matter much as memset makes that function slow and it shouldn't be
> > used in performance sensitive code.
> > 
> > Otherwise ok for me.
> 
> On the benchtests the extra if made a significant difference, particularly
> since memset of 0 is relatively expensive as it is being regarded as a very
> rare case. It seems it should be less likely than the benchtests indicate,
> but we'd have to fix the benchtest first to use more realistic data.
>
Ok, I did data collection and I take my objection back as it almost
always happens in bash. I was surprised why it needs to use strncpy to
copy small number of bytes.  

When I tested dryrun benchmark special casing is faster. I got following
data on strncpy but no on stpncpy so we could reuse that for stpcpy patch that you also submitted.

replaying bash

calls 194

average n:  15.6082    n <= 0:   4.6% n <= 4:   6.7% n <= 8:  43.3% n <= 16:  80.9% n <= 24:  88.7% n <= 32:  88.7% n <= 48:  91.2% n <= 64:  98.5% 
s aligned to 4 bytes:  99.5%  8 bytes:  97.9% 16 bytes:   0.5% 
average *s access cache latency  0.9072    l <= 8: 100.0% l <= 16: 100.0% l <= 32: 100.0% l <= 64: 100.0% l <= 128: 100.0% 
s2 aligned to 4 bytes:  34.0%  8 bytes:  24.2% 16 bytes:   1.5% 
s-s2 aligned to 4 bytes:  34.5%  8 bytes:  22.7% 16 bytes:  22.7% 
average *s2 access cache latency  1.1186    l <= 8: 100.0% l <= 16: 100.0% l <= 32: 100.0% l <= 64: 100.0% l <= 128: 100.0% 
average capacity:  15.6082    c <= 0:   4.6% c <= 4:   6.7% c <= 8:  43.3% c <= 16:  80.9% c <= 24:  88.7% c <= 32:  88.7% c <= 48:  91.2% c <= 64:  98.5% n == capa : 100.0% 

replaying mc

calls 6971

average n:  9.0773    n <= 0:   2.7% n <= 4:  54.3% n <= 8:  71.9% n <= 16:  85.5% n <= 24:  91.0% n <= 32:  94.2% n <= 48:  96.9% n <= 64:  98.7% 
s aligned to 4 bytes: 100.0%  8 bytes: 100.0% 16 bytes: 100.0% 
average *s access cache latency  36.3347    l <= 8: 100.0% l <= 16: 100.0% l <= 32: 100.0% l <= 64: 100.0% l <= 128: 100.0% 
s2 aligned to 4 bytes:  55.3%  8 bytes:  49.5% 16 bytes:  48.1% 
s-s2 aligned to 4 bytes:  55.3%  8 bytes:  49.5% 16 bytes:  48.1% 
average *s2 access cache latency  1.0126    l <= 8: 100.0% l <= 16: 100.0% l <= 32: 100.0% l <= 64: 100.0% l <= 128: 100.0% 
average capacity:  9.5847    c <= 0:   0.7% c <= 4:  52.2% c <= 8:  70.9% c <= 16:  84.9% c <= 24:  90.7% c <= 32:  93.8% c <= 48:  96.5% c <= 64:  98.7% n == capa :  63.6% 

replaying mutt

calls 10415

average n:  7.6572    n <= 0:   0.1% n <= 4:  68.6% n <= 8:  82.5% n <= 16:  86.4% n <= 24:  87.7% n <= 32:  94.5% n <= 48:  96.6% n <= 64:  98.9% 
s aligned to 4 bytes:  57.9%  8 bytes:  49.1% 16 bytes:  45.2% 
average *s access cache latency  1.1092    l <= 8: 100.0% l <= 16: 100.0% l <= 32: 100.0% l <= 64: 100.0% l <= 128: 100.0% 
s2 aligned to 4 bytes:  85.7%  8 bytes:  79.8% 16 bytes:  79.5% 
s-s2 aligned to 4 bytes:  43.6%  8 bytes:  28.9% 16 bytes:  24.7% 
average *s2 access cache latency  1.1324    l <= 8: 100.0% l <= 16: 100.0% l <= 32: 100.0% l <= 64: 100.0% l <= 128: 100.0% 
average capacity:  51.2750    c <= 0:   0.0% c <= 4:  65.3% c <= 8:  73.0% c <= 16:  74.6% c <= 24:  75.2% c <= 32:  75.3% c <= 48:  75.3% c <= 64:  75.3% n == capa :  64.4% 

replaying /bin/bash

calls 60

average n:  10.3167    n <= 0:   1.7% n <= 4:  26.7% n <= 8:  56.7% n <= 16:  88.3% n <= 24:  98.3% n <= 32:  98.3% n <= 48:  98.3% n <= 64:  98.3% 
s aligned to 4 bytes: 100.0%  8 bytes: 100.0% 16 bytes:   1.7% 
average *s access cache latency  0.8833    l <= 8: 100.0% l <= 16: 100.0% l <= 32: 100.0% l <= 64: 100.0% l <= 128: 100.0% 
s2 aligned to 4 bytes:  36.7%  8 bytes:  33.3% 16 bytes:   3.3% 
s-s2 aligned to 4 bytes:  36.7%  8 bytes:  33.3% 16 bytes:  31.7% 
average *s2 access cache latency  0.9167    l <= 8: 100.0% l <= 16: 100.0% l <= 32: 100.0% l <= 64: 100.0% l <= 128: 100.0% 
average capacity:  10.3167    c <= 0:   1.7% c <= 4:  26.7% c <= 8:  56.7% c <= 16:  88.3% c <= 24:  98.3% c <= 32:  98.3% c <= 48:  98.3% c <= 64:  98.3% n == capa : 100.0% 

replaying as

calls 122

average n:  6.8115    n <= 0:   0.8% n <= 4:   6.6% n <= 8:  95.1% n <= 16:  98.4% n <= 24: 100.0% n <= 32: 100.0% n <= 48: 100.0% n <= 64: 100.0% 
s aligned to 4 bytes: 100.0%  8 bytes: 100.0% 16 bytes: 100.0% 
average *s access cache latency  1.0410    l <= 8: 100.0% l <= 16: 100.0% l <= 32: 100.0% l <= 64: 100.0% l <= 128: 100.0% 
s2 aligned to 4 bytes:  25.4%  8 bytes:  13.1% 16 bytes:   5.7% 
s-s2 aligned to 4 bytes:  25.4%  8 bytes:  13.1% 16 bytes:   5.7% 
average *s2 access cache latency  0.9262    l <= 8: 100.0% l <= 16: 100.0% l <= 32: 100.0% l <= 64: 100.0% l <= 128: 100.0% 
average capacity:  126.9508    c <= 0:   0.8% c <= 4:   0.8% c <= 8:   0.8% c <= 16:   0.8% c <= 24:   0.8% c <= 32:   0.8% c <= 48:   0.8% c <= 64:   0.8% n == capa :   0.8% 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [PATCH][PING] Improve stpncpy performance
  2015-08-12 20:57     ` Ondřej Bílka
@ 2015-08-19 16:15       ` Wilco Dijkstra
  0 siblings, 0 replies; 5+ messages in thread
From: Wilco Dijkstra @ 2015-08-19 16:15 UTC (permalink / raw)
  To: 'Ondřej Bílka'; +Cc: libc-alpha

> Ondřej Bílka wrote:
> On Thu, Jul 09, 2015 at 04:02:26PM +0100, Wilco Dijkstra wrote:
> > > Ondřej Bílka wrote:
> > >
> > > You don't have to use special case
> > >
> > > if (size == n)
> > >   return dest;
> > >
> > > as it should be handled by
> > >
> > > return memset (dest, '\0', 0);
> > >
> > > That could improve performance a bit if its rare case. That doesn't
> > > matter much as memset makes that function slow and it shouldn't be
> > > used in performance sensitive code.
> > >
> > > Otherwise ok for me.
> >
> > On the benchtests the extra if made a significant difference, particularly
> > since memset of 0 is relatively expensive as it is being regarded as a very
> > rare case. It seems it should be less likely than the benchtests indicate,
> > but we'd have to fix the benchtest first to use more realistic data.
> >
> Ok, I did data collection and I take my objection back as it almost
> always happens in bash. I was surprised why it needs to use strncpy to
> copy small number of bytes.
> 
> When I tested dryrun benchmark special casing is faster. I got following
> data on strncpy but no on stpncpy so we could reuse that for stpcpy patch that you also
> submitted.

OK thanks for checking the statistics. I've now committed it.

Wilco


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-08-19 16:15 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-06 11:29 [PATCH][PING] Improve stpncpy performance Wilco Dijkstra
2015-07-09 13:21 ` Ondřej Bílka
2015-07-09 15:02   ` Wilco Dijkstra
2015-08-12 20:57     ` Ondřej Bílka
2015-08-19 16:15       ` Wilco Dijkstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).