Re: GCC performance regression

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Re: GCC performance regression - its memset!
@ 2002-04-22 23:07 Roger Sayle
  2002-04-22 23:30 ` Michel LESPINASSE
  0 siblings, 1 reply; 33+ messages in thread
From: Roger Sayle @ 2002-04-22 23:07 UTC (permalink / raw)
  To: gcc, Michel Lespinasse, Richard Henderson, Jan Hubicka

I think its one of Jan's changes.  I can reproduce the problem, and
fix it using "-minline-all-stringops" which forces 3.1 to inline the
memset on i686.  I was concerned that it was a middle-end bug with
builtins, but it now appears to be an ia32 back-end issue.

Michel, does "-minline-all-stringops" fix the problem for you?

Roger
--
Roger Sayle,                         E-mail: roger@eyesopen.com
OpenEye Scientific Software,         WWW: http://www.eyesopen.com/
Suite 1107, 3600 Cerrillos Road,     Tel: (+1) 505-473-7385
Santa Fe, New Mexico, 87507.         Fax: (+1) 505-473-0833

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GCC performance regression - its memset!
  2002-04-22 23:07 GCC performance regression - its memset! Roger Sayle
@ 2002-04-22 23:30 ` Michel LESPINASSE
  2002-04-23  0:45   ` Michel LESPINASSE
  2002-04-23  2:53   ` Jan Hubicka
  0 siblings, 2 replies; 33+ messages in thread
From: Michel LESPINASSE @ 2002-04-22 23:30 UTC (permalink / raw)
  To: Roger Sayle; +Cc: gcc, Richard Henderson, Jan Hubicka

On Mon, Apr 22, 2002 at 11:13:09PM -0600, Roger Sayle wrote:
> 
> I think its one of Jan's changes.  I can reproduce the problem, and
> fix it using "-minline-all-stringops" which forces 3.1 to inline the
> memset on i686.  I was concerned that it was a middle-end bug with
> builtins, but it now appears to be an ia32 back-end issue.
> 
> Michel, does "-minline-all-stringops" fix the problem for you?

This option actually generates invalid code for me. Here is a test case:

------------------- cut here -----------------
#include <string.h>

short table[64];

int main (void)
{
    int i;

    for (i = 0; i < 64; i++)
        table[i] = 1234;

    memset (table, 0, 63 * sizeof(short));

    return (table[63] != 0);
}
------------------- cut here -----------------

This code should return 0, however it returns 1 (compiled with -O3
-minline-all-stringops)

Here is an extract from the generated asm (the memset part of it):
        movl    $table, %edi
        testl   $1, %edi       <- test 1-byte alignment (hmmm, isnt table
                                  already two-byte aligned, being a short ?)
        movl    $126, %eax     <- we want to clear 126 bytes
        je      .L7
        movb    $0, table
        movl    $table+1, %edi <- now edi is guaranteed two-byte-aligned
        movl    $125, %eax
.L7:
        testl   $2, %edi       <- test 4-byte alignment
        je      .L8
        movw    $0, (%edi)
        subl    $2, %eax       <- now edi is guaranteed four-byte-aligned
        addl    $2, %edi
.L8:
        cld
        movl    %eax, %ecx
        xorl    %eax, %eax
        shrl    $2, %ecx       <- number of 4-byte words remaining
        rep
        stosl
        testl   $2, %edi       <- ooops, its really meant to test the remainder
                                  not the address !!! so test will always fail.
        je      .L9
        movw    $0, (%edi)
        addl    $2, %edi
.L9:
        testl   $1, %edi       <- that one too.
        je      .L10
        movb    $0, (%edi)
.L10:


2.95 was generating simpler code:
        movl $table,%edi
        xorl %eax,%eax
        cld
        movl $31,%ecx
        rep
        stosl
        stosw

This did not take care about alignment issues, but was simpler and
actually faster on my athlon.

Hope this helps,

-- 
Michel "Walken" LESPINASSE
Is this the best that god can do ? Then I'm not impressed.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GCC performance regression - its memset!
  2002-04-22 23:30 ` Michel LESPINASSE
@ 2002-04-23  0:45   ` Michel LESPINASSE
  2002-04-23  2:53   ` Jan Hubicka
  1 sibling, 0 replies; 33+ messages in thread
From: Michel LESPINASSE @ 2002-04-23  0:45 UTC (permalink / raw)
  To: Roger Sayle; +Cc: gcc, Richard Henderson, Jan Hubicka

My test case was actually wrong - the final test should be with
table[62] not table[63]. But, with this correction, it does exhibit
the bug I was mentionning.

Sorry for the mistake :/

On Mon, Apr 22, 2002 at 11:07:09PM -0700, Michel LESPINASSE wrote:
> This option actually generates invalid code for me. Here is a test case:
> ------------------- cut here -----------------
> #include <string.h>
> 
> short table[64];
> 
> int main (void)
> {
>     int i;
> 
>     for (i = 0; i < 64; i++)
>         table[i] = 1234;
> 
>     memset (table, 0, 63 * sizeof(short));
> 
>     return (table[63] != 0);
> }
> ------------------- cut here -----------------
> This code should return 0, however it returns 1 (compiled with -O3
> -minline-all-stringops)

-- 
Michel "Walken" LESPINASSE
Is this the best that god can do ? Then I'm not impressed.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GCC performance regression - its memset!
  2002-04-22 23:30 ` Michel LESPINASSE
  2002-04-23  0:45   ` Michel LESPINASSE
@ 2002-04-23  2:53   ` Jan Hubicka
  2002-04-23  3:28     ` Richard Henderson
  2002-05-20  8:06     ` [3.1.1] " Jan Hubicka
  1 sibling, 2 replies; 33+ messages in thread
From: Jan Hubicka @ 2002-04-23  2:53 UTC (permalink / raw)
  To: Michel LESPINASSE
  Cc: Roger Sayle, gcc, Richard Henderson, Jan Hubicka, gcc-patches

>         rep
>         stosl
>         testl   $2, %edi       <- ooops, its really meant to test the remainder
>                                   not the address !!! so test will always fail.
>         je      .L9
>         movw    $0, (%edi)
>         addl    $2, %edi
> .L9:
>         testl   $1, %edi       <- that one too.
>         je      .L10
>         movb    $0, (%edi)
> .L10:

Hmm, an pasto.
In memcpy case I got it right, while in memset I broke it.  I am attaching patch
I am testing currently.  OK for mainline/branch assuming it passes?

COncerning the inlining, gcc inlines all memcpys with size smaller than 64 bytes.
Perhaps this should be extended to 128 bytes in case we are still about 2 times as bad.
This is partly due to lame implementation of memset in glibc too :(


Tue Apr 23 11:48:53 CEST 2002  Jan HUbicka  <jh@suse.cz>
	* i386.c (ix86_expand_clrstr): Fix pasto.
Index: i386.c
===================================================================
RCS file: /cvs/gcc/egcs/gcc/config/i386/i386.c,v
retrieving revision 1.353.2.14
diff -c -3 -p -r1.353.2.14 i386.c
*** i386.c	16 Apr 2002 18:16:36 -0000	1.353.2.14
--- i386.c	23 Apr 2002 09:47:50 -0000
*************** ix86_expand_clrstr (src, count_exp, alig
*** 9451,9457 ****
  				 gen_rtx_SUBREG (SImode, zeroreg, 0)));
        if (TARGET_64BIT && (align <= 4 || count == 0))
  	{
! 	  rtx label = ix86_expand_aligntest (destreg, 2);
  	  emit_insn (gen_strsetsi (destreg,
  				   gen_rtx_SUBREG (SImode, zeroreg, 0)));
  	  emit_label (label);
--- 9451,9457 ----
  				 gen_rtx_SUBREG (SImode, zeroreg, 0)));
        if (TARGET_64BIT && (align <= 4 || count == 0))
  	{
! 	  rtx label = ix86_expand_aligntest (countreg, 2);
  	  emit_insn (gen_strsetsi (destreg,
  				   gen_rtx_SUBREG (SImode, zeroreg, 0)));
  	  emit_label (label);
*************** ix86_expand_clrstr (src, count_exp, alig
*** 9462,9468 ****
  				 gen_rtx_SUBREG (HImode, zeroreg, 0)));
        if (align <= 2 || count == 0)
  	{
! 	  rtx label = ix86_expand_aligntest (destreg, 2);
  	  emit_insn (gen_strsethi (destreg,
  				   gen_rtx_SUBREG (HImode, zeroreg, 0)));
  	  emit_label (label);
--- 9462,9468 ----
  				 gen_rtx_SUBREG (HImode, zeroreg, 0)));
        if (align <= 2 || count == 0)
  	{
! 	  rtx label = ix86_expand_aligntest (countreg, 2);
  	  emit_insn (gen_strsethi (destreg,
  				   gen_rtx_SUBREG (HImode, zeroreg, 0)));
  	  emit_label (label);
*************** ix86_expand_clrstr (src, count_exp, alig
*** 9473,9479 ****
  				 gen_rtx_SUBREG (QImode, zeroreg, 0)));
        if (align <= 1 || count == 0)
  	{
! 	  rtx label = ix86_expand_aligntest (destreg, 1);
  	  emit_insn (gen_strsetqi (destreg,
  				   gen_rtx_SUBREG (QImode, zeroreg, 0)));
  	  emit_label (label);
--- 9473,9479 ----
  				 gen_rtx_SUBREG (QImode, zeroreg, 0)));
        if (align <= 1 || count == 0)
  	{
! 	  rtx label = ix86_expand_aligntest (countreg, 1);
  	  emit_insn (gen_strsetqi (destreg,
  				   gen_rtx_SUBREG (QImode, zeroreg, 0)));
  	  emit_label (label);

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GCC performance regression - its memset!
  2002-04-23  2:53   ` Jan Hubicka
@ 2002-04-23  3:28     ` Richard Henderson
  2002-05-20  8:06     ` [3.1.1] " Jan Hubicka
  1 sibling, 0 replies; 33+ messages in thread
From: Richard Henderson @ 2002-04-23  3:28 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Michel LESPINASSE, Roger Sayle, gcc, gcc-patches

On Tue, Apr 23, 2002 at 11:51:45AM +0200, Jan Hubicka wrote:
> 	* i386.c (ix86_expand_clrstr): Fix pasto.

Ok.


r~

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [3.1.1] Re: GCC performance regression - its memset!
  2002-04-23  2:53   ` Jan Hubicka
  2002-04-23  3:28     ` Richard Henderson
@ 2002-05-20  8:06     ` Jan Hubicka
  2002-05-20  9:36       ` Glen Nakamura
                         ` (3 more replies)
  1 sibling, 4 replies; 33+ messages in thread
From: Jan Hubicka @ 2002-05-20  8:06 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Michel LESPINASSE, Roger Sayle, gcc, Richard Henderson,
	gcc-patches, mark

> 
> Hmm, an pasto.
> In memcpy case I got it right, while in memset I broke it.  I am attaching patch
> I am testing currently.  OK for mainline/branch assuming it passes?
> 
> COncerning the inlining, gcc inlines all memcpys with size smaller than 64 bytes.
> Perhaps this should be extended to 128 bytes in case we are still about 2 times as bad.
> This is partly due to lame implementation of memset in glibc too :(

Mark,
Would this patch be OK for 3.1.1 branch?  It fixes serious misscompilation.
Not really regression, since extra switch is needed, but that switch seems
to be popular.
> 
> 
> Tue Apr 23 11:48:53 CEST 2002  Jan HUbicka  <jh@suse.cz>
> 	* i386.c (ix86_expand_clrstr): Fix pasto.
> Index: i386.c
> ===================================================================
> RCS file: /cvs/gcc/egcs/gcc/config/i386/i386.c,v
> retrieving revision 1.353.2.14
> diff -c -3 -p -r1.353.2.14 i386.c
> *** i386.c	16 Apr 2002 18:16:36 -0000	1.353.2.14
> --- i386.c	23 Apr 2002 09:47:50 -0000
> *************** ix86_expand_clrstr (src, count_exp, alig
> *** 9451,9457 ****
>   				 gen_rtx_SUBREG (SImode, zeroreg, 0)));
>         if (TARGET_64BIT && (align <= 4 || count == 0))
>   	{
> ! 	  rtx label = ix86_expand_aligntest (destreg, 2);
>   	  emit_insn (gen_strsetsi (destreg,
>   				   gen_rtx_SUBREG (SImode, zeroreg, 0)));
>   	  emit_label (label);
> --- 9451,9457 ----
>   				 gen_rtx_SUBREG (SImode, zeroreg, 0)));
>         if (TARGET_64BIT && (align <= 4 || count == 0))
>   	{
> ! 	  rtx label = ix86_expand_aligntest (countreg, 2);
>   	  emit_insn (gen_strsetsi (destreg,
>   				   gen_rtx_SUBREG (SImode, zeroreg, 0)));
>   	  emit_label (label);
> *************** ix86_expand_clrstr (src, count_exp, alig
> *** 9462,9468 ****
>   				 gen_rtx_SUBREG (HImode, zeroreg, 0)));
>         if (align <= 2 || count == 0)
>   	{
> ! 	  rtx label = ix86_expand_aligntest (destreg, 2);
>   	  emit_insn (gen_strsethi (destreg,
>   				   gen_rtx_SUBREG (HImode, zeroreg, 0)));
>   	  emit_label (label);
> --- 9462,9468 ----
>   				 gen_rtx_SUBREG (HImode, zeroreg, 0)));
>         if (align <= 2 || count == 0)
>   	{
> ! 	  rtx label = ix86_expand_aligntest (countreg, 2);
>   	  emit_insn (gen_strsethi (destreg,
>   				   gen_rtx_SUBREG (HImode, zeroreg, 0)));
>   	  emit_label (label);
> *************** ix86_expand_clrstr (src, count_exp, alig
> *** 9473,9479 ****
>   				 gen_rtx_SUBREG (QImode, zeroreg, 0)));
>         if (align <= 1 || count == 0)
>   	{
> ! 	  rtx label = ix86_expand_aligntest (destreg, 1);
>   	  emit_insn (gen_strsetqi (destreg,
>   				   gen_rtx_SUBREG (QImode, zeroreg, 0)));
>   	  emit_label (label);
> --- 9473,9479 ----
>   				 gen_rtx_SUBREG (QImode, zeroreg, 0)));
>         if (align <= 1 || count == 0)
>   	{
> ! 	  rtx label = ix86_expand_aligntest (countreg, 1);
>   	  emit_insn (gen_strsetqi (destreg,
>   				   gen_rtx_SUBREG (QImode, zeroreg, 0)));
>   	  emit_label (label);

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [3.1.1] Re: GCC performance regression - its memset!
  2002-05-20  8:06     ` [3.1.1] " Jan Hubicka
@ 2002-05-20  9:36       ` Glen Nakamura
  2002-05-20 10:08       ` Richard Henderson
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 33+ messages in thread
From: Glen Nakamura @ 2002-05-20  9:36 UTC (permalink / raw)
  To: jh; +Cc: walken, roger, gcc, rth, gcc-patches, mark

> Tue Apr 23 11:48:53 CEST 2002  Jan HUbicka  <jh@suse.cz>
> 	* i386.c (ix86_expand_clrstr): Fix pasto.

I believe this patch will fix optimization/6703:
http://gcc.gnu.org/cgi-bin/gnatsweb.pl?cmd=view%20audit-trail&database=gcc&pr=6703

It also fixes a miscompilation of mozilla bug #145267:
http://bugzilla.mozilla.org/show_bug.cgi?id=145267

BTW, I had a few more testsuite failures when I added this patch.
I'm not sure if it's related though...  anyone else get these errors?

- Glen Nakamura


Test Run By glen on Sun May 19 21:00:16 2002
Native configuration is i586-pc-linux-gnu

		=== gcc tests ===

Schedule of variations:
    unix

Running target unix
FAIL: gcc.c-torture/execute/20020402-1.c execution,  -Os
FAIL: gcc.c-torture/execute/loop-2e.c execution,  -Os
FAIL: gcc.c-torture/execute/loop-3c.c execution,  -Os
FAIL: gcc.dg/format/c90-scanf-1.c writing into NULL (test for warnings, line 118)
FAIL: gcc.dg/format/c90-scanf-1.c (test for excess errors)
FAIL: gcc.dg/format/c99-printf-1.c %hhn unsigned char (test for warnings, line 196)
FAIL: gcc.dg/format/c99-printf-1.c (test for excess errors)
FAIL: gcc.dg/format/ext-3.c bad %-_ (test for warnings, line 211)
FAIL: gcc.dg/format/ext-3.c bad %-0 (test for warnings, line 212)
FAIL: gcc.dg/format/ext-3.c bad %_0 (test for warnings, line 213)
FAIL: gcc.dg/format/ext-3.c bad %^# (test for warnings, line 215)
FAIL: gcc.dg/format/ext-3.c (test for excess errors)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [3.1.1] Re: GCC performance regression - its memset!
  2002-05-20  8:06     ` [3.1.1] " Jan Hubicka
  2002-05-20  9:36       ` Glen Nakamura
@ 2002-05-20 10:08       ` Richard Henderson
  2002-05-20 10:38       ` Jakub Jelinek
  2002-05-20 12:07       ` Mark Mitchell
  3 siblings, 0 replies; 33+ messages in thread
From: Richard Henderson @ 2002-05-20 10:08 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Michel LESPINASSE, Roger Sayle, gcc, gcc-patches, mark

On Mon, May 20, 2002 at 04:48:39PM +0200, Jan Hubicka wrote:
> > 	* i386.c (ix86_expand_clrstr): Fix pasto.

Ok for branch.


r~

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [3.1.1] Re: GCC performance regression - its memset!
  2002-05-20  8:06     ` [3.1.1] " Jan Hubicka
  2002-05-20  9:36       ` Glen Nakamura
  2002-05-20 10:08       ` Richard Henderson
@ 2002-05-20 10:38       ` Jakub Jelinek
  2002-05-20 11:28         ` Roger Sayle
  2002-05-21  8:04         ` Jan Hubicka
  2002-05-20 12:07       ` Mark Mitchell
  3 siblings, 2 replies; 33+ messages in thread
From: Jakub Jelinek @ 2002-05-20 10:38 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Michel LESPINASSE, Roger Sayle, gcc, Richard Henderson,
	gcc-patches, mark

On Mon, May 20, 2002 at 04:48:39PM +0200, Jan Hubicka wrote:
> > 
> > Hmm, an pasto.
> > In memcpy case I got it right, while in memset I broke it.  I am attaching patch
> > I am testing currently.  OK for mainline/branch assuming it passes?
> > 
> > COncerning the inlining, gcc inlines all memcpys with size smaller than 64 bytes.
> > Perhaps this should be extended to 128 bytes in case we are still about 2 times as bad.
> > This is partly due to lame implementation of memset in glibc too :(
> 
> Mark,
> Would this patch be OK for 3.1.1 branch?  It fixes serious misscompilation.
> Not really regression, since extra switch is needed, but that switch seems
> to be popular.

Do you consider -O or -O2 a special switch?
Anyway, could the testcase be commited with the patch too (the bug shows
up in C too BTW)?

--- gcc/testsuite/gcc.c-torture/execute/20020520-1.c.jj   Thu Aug 30 22:30:55 2001
+++ gcc/testsuite/gcc.c-torture/execute/20020520-1.c    Mon May 20 19:00:41 2002
@@ -0,0 +1,20 @@
+/* PR optimization/6703 */
+extern void abort (void);
+extern void exit (int);
+
+void foo (void **x, int y)
+{
+  __builtin_memset (x, 0, y);
+}
+
+int main ()
+{
+  unsigned char x[8] __attribute__ ((aligned (4)));
+  int i;
+
+  __builtin_memset (x, 0x5a, sizeof (x));
+  foo ((void **) (x + 3), 1);
+  for (i = 0; i < 8; i++)
+    if (x[i] != (i != 3 ? 0x5a : 0))
+      abort ();
+  exit (0);
+}

> > 
> > 
> > Tue Apr 23 11:48:53 CEST 2002  Jan HUbicka  <jh@suse.cz>
> > 	* i386.c (ix86_expand_clrstr): Fix pasto.

	Jakub

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [3.1.1] Re: GCC performance regression - its memset!
  2002-05-20 10:38       ` Jakub Jelinek
@ 2002-05-20 11:28         ` Roger Sayle
  2002-05-20 12:57           ` Glen Nakamura
  2002-05-21  8:04         ` Jan Hubicka
  1 sibling, 1 reply; 33+ messages in thread
From: Roger Sayle @ 2002-05-20 11:28 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Jan Hubicka, gcc-patches, gcc

> Do you consider -O or -O2 a special switch?
> Anyway, could the testcase be commited with the patch too (the bug shows
> up in C too BTW)?
>
> --- gcc/testsuite/gcc.c-torture/execute/20020520-1.c.jj   Thu Aug 30 22:30:55 2001
> +++ gcc/testsuite/gcc.c-torture/execute/20020520-1.c    Mon May 20 19:00:41 2002
> @@ -0,0 +1,20 @@
> +/* PR optimization/6703 */
> +extern void abort (void);
> +extern void exit (int);
> +
> +void foo (void **x, int y)
> +{
> +  __builtin_memset (x, 0, y);
> +}
> +
> +int main ()
> +{
> +  unsigned char x[8] __attribute__ ((aligned (4)));
> +  int i;
> +
> +  __builtin_memset (x, 0x5a, sizeof (x));
> +  foo ((void **) (x + 3), 1);
> +  for (i = 0; i < 8; i++)
> +    if (x[i] != (i != 3 ? 0x5a : 0))
> +      abort ();
> +  exit (0);
> +}

Hi Jakub,

As I've mentioned in a previous e-mail, I believe that this test
case is poorly formed; quite clearly the x argument to foo is
incorrectly aligned for the type.  On many platforms, the above
code (and even with foo defined as "*x = 0;") will fail at runtime.

The benefit of Jan's patch is that on machines without alignment
restrictions, we'll still produce the "expected behaviour" for
incorrectly written programs.  Mozilla's source code is incorrect
and needs to be fixed (use void* rather than void**).  This isn't
really a correctness PR, more a quality of implementation for
undefined behaviour PR.

Roger
--

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [3.1.1] Re: GCC performance regression - its memset!
  2002-05-20 11:28         ` Roger Sayle
@ 2002-05-20 12:57           ` Glen Nakamura
  2002-05-20 16:58             ` Roger Sayle
  2002-05-21  8:23             ` Jack Lloyd
  0 siblings, 2 replies; 33+ messages in thread
From: Glen Nakamura @ 2002-05-20 12:57 UTC (permalink / raw)
  To: roger; +Cc: jakub, jh, gcc

> Hi Jakub,
> 
> As I've mentioned in a previous e-mail, I believe that this test
> case is poorly formed; quite clearly the x argument to foo is
> incorrectly aligned for the type.  On many platforms, the above
> code (and even with foo defined as "*x = 0;") will fail at runtime.
> 
> The benefit of Jan's patch is that on machines without alignment
> restrictions, we'll still produce the "expected behaviour" for
> incorrectly written programs.  Mozilla's source code is incorrect
> and needs to be fixed (use void* rather than void**).  This isn't
> really a correctness PR, more a quality of implementation for
> undefined behaviour PR.
> 
> Roger

Aloha,

FWIW, I think this really is a regression...  The problem is really
that memset doesn't properly set bytes at the tail of the buffer.
How about this testcase?

- Glen Nakamura


/* PR optimization/6703 */
extern void abort (void);
extern void exit (int);

void foo (int *x, int y)
{
  __builtin_memset (x, 0, y);
}

int main ()
{
  int x[2] = { 0x5a5a5a5a, 0x5a5a5a5a };

  if (x[1] != 0x5a5a5a5a)
      abort ();
  foo (x, 5);
  if (x[1] == 0x5a5a5a5a)
      abort ();
  exit (0);
}

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [3.1.1] Re: GCC performance regression - its memset!
  2002-05-20 12:57           ` Glen Nakamura
@ 2002-05-20 16:58             ` Roger Sayle
  2002-05-21  8:23             ` Jack Lloyd
  1 sibling, 0 replies; 33+ messages in thread
From: Roger Sayle @ 2002-05-20 16:58 UTC (permalink / raw)
  To: Glen Nakamura; +Cc: jakub, jh, gcc

Hi Glen,
> FWIW, I think this really is a regression...  The problem is really
> that memset doesn't properly set bytes at the tail of the buffer.

My apologies.  You're right it really is a serious regression!

> How about this testcase?

The other possibility is to copy the memset-1.c and memset-2.c
tests from CVS mainline to the branch's gcc.c-torture/execute.
These "correctness" tests would also have uncovered the regression.

Roger
--

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [3.1.1] Re: GCC performance regression - its memset!
  2002-05-20 12:57           ` Glen Nakamura
  2002-05-20 16:58             ` Roger Sayle
@ 2002-05-21  8:23             ` Jack Lloyd
  2002-05-21  9:55               ` Glen Nakamura
  1 sibling, 1 reply; 33+ messages in thread
From: Jack Lloyd @ 2002-05-21  8:23 UTC (permalink / raw)
  To: gcc

This looks to be an endian issue.

Some code:
------------- CUT ----------------
#include <stdio.h>

void foo(void *x, int y)
   {
   __builtin_memset(x, 0, y);
   }

int main()
   {
   int x[2] = { 0x5a5a5a5a, 0x5a5a5a5a };
   printf("%08X %08X\n", x[0], x[1]);
   foo(x, 5);
   printf("%08X %08X\n", x[0], x[1]);
   return 0;
   }
------------- CUT ----------------

Let's try this on x86:

5A5A5A5A 5A5A5A5A
00000000 5A5A5A00

And again on SPARC:

5A5A5A5A 5A5A5A5A
00000000 005A5A5A

where it comes out what you expect, at least if you expect what I expect
(or something...)

And, honestly, I'm not sure if this is how it's supposed to be or not.
What are the rules for doing a memset across part of an object? Certainly,
in both cases it sets 5 bytes to zero, and those bytes are contiguous in
memory. The thing is, I can't imagine how you could actually change this on
x86 to work like it does on a big-endian machine, because you would have to
know the size of the type involved.

HTH,
  -Jack

On Mon, 20 May 2002, Glen Nakamura wrote:

> > Hi Jakub,
> >
> > As I've mentioned in a previous e-mail, I believe that this test
> > case is poorly formed; quite clearly the x argument to foo is
> > incorrectly aligned for the type.  On many platforms, the above
> > code (and even with foo defined as "*x = 0;") will fail at runtime.
> >
> > The benefit of Jan's patch is that on machines without alignment
> > restrictions, we'll still produce the "expected behaviour" for
> > incorrectly written programs.  Mozilla's source code is incorrect
> > and needs to be fixed (use void* rather than void**).  This isn't
> > really a correctness PR, more a quality of implementation for
> > undefined behaviour PR.
> >
> > Roger
>
> Aloha,
>
> FWIW, I think this really is a regression...  The problem is really
> that memset doesn't properly set bytes at the tail of the buffer.
> How about this testcase?
>
> - Glen Nakamura
>
>
> /* PR optimization/6703 */
> extern void abort (void);
> extern void exit (int);
>
> void foo (int *x, int y)
> {
>   __builtin_memset (x, 0, y);
> }
>
> int main ()
> {
>   int x[2] = { 0x5a5a5a5a, 0x5a5a5a5a };
>
>   if (x[1] != 0x5a5a5a5a)
>       abort ();
>   foo (x, 5);
>   if (x[1] == 0x5a5a5a5a)
>       abort ();
>   exit (0);
> }
>



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [3.1.1] Re: GCC performance regression - its memset!
  2002-05-21  8:23             ` Jack Lloyd
@ 2002-05-21  9:55               ` Glen Nakamura
  2002-05-21 11:13                 ` Jack Lloyd
  0 siblings, 1 reply; 33+ messages in thread
From: Glen Nakamura @ 2002-05-21  9:55 UTC (permalink / raw)
  To: lloyd; +Cc: gcc

> This looks to be an endian issue.
> 
> Some code:
> ------------- CUT ----------------
> #include <stdio.h>
> 
> void foo(void *x, int y)
>    {
>    __builtin_memset(x, 0, y);
>    }
> 
> int main()
>    {
>    int x[2] = { 0x5a5a5a5a, 0x5a5a5a5a };
>    printf("%08X %08X\n", x[0], x[1]);
>    foo(x, 5);
>    printf("%08X %08X\n", x[0], x[1]);
>    return 0;
>    }
> ------------- CUT ----------------
> 
> Let's try this on x86:
> 
> 5A5A5A5A 5A5A5A5A
> 00000000 5A5A5A00
> 
> And again on SPARC:
> 
> 5A5A5A5A 5A5A5A5A
> 00000000 005A5A5A
> 
> where it comes out what you expect, at least if you expect what I expect
> (or something...)
> 
> And, honestly, I'm not sure if this is how it's supposed to be or not.
> What are the rules for doing a memset across part of an object? Certainly,
> in both cases it sets 5 bytes to zero, and those bytes are contiguous in
> memory. The thing is, I can't imagine how you could actually change this on
> x86 to work like it does on a big-endian machine, because you would have to
> know the size of the type involved.
> 
> HTH,
>   -Jack
> 

Actually, the failure occurs with gcc-3.1 on x86 when compiling with -O2.
I would expect the following output from your example:

5A5A5A5A 5A5A5A5A
00000000 5A5A5A5A

Only the first 4 bytes are set...

- Glen Nakamura

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [3.1.1] Re: GCC performance regression - its memset!
  2002-05-21  9:55               ` Glen Nakamura
@ 2002-05-21 11:13                 ` Jack Lloyd
  0 siblings, 0 replies; 33+ messages in thread
From: Jack Lloyd @ 2002-05-21 11:13 UTC (permalink / raw)
  To: Glen Nakamura; +Cc: gcc

On Tue, 21 May 2002, Glen Nakamura wrote:

Ah, OK. My mistake (no optimizations used on mine). -J

> Actually, the failure occurs with gcc-3.1 on x86 when compiling with -O2.
> I would expect the following output from your example:
>
> 5A5A5A5A 5A5A5A5A
> 00000000 5A5A5A5A
>
> Only the first 4 bytes are set...
>
> - Glen Nakamura
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [3.1.1] Re: GCC performance regression - its memset!
  2002-05-20 10:38       ` Jakub Jelinek
  2002-05-20 11:28         ` Roger Sayle
@ 2002-05-21  8:04         ` Jan Hubicka
  1 sibling, 0 replies; 33+ messages in thread
From: Jan Hubicka @ 2002-05-21  8:04 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Jan Hubicka, Michel LESPINASSE, Roger Sayle, gcc,
	Richard Henderson, gcc-patches, mark

> > Mark,
> > Would this patch be OK for 3.1.1 branch?  It fixes serious misscompilation.
> > Not really regression, since extra switch is needed, but that switch seems
> > to be popular.
> 
> Do you consider -O or -O2 a special switch?
> Anyway, could the testcase be commited with the patch too (the bug shows
> up in C too BTW)?

I meant -minline-all-stringops and didn't noticed this reproduce for some
cases of default settings as well..

Honza

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [3.1.1] Re: GCC performance regression - its memset!
  2002-05-20  8:06     ` [3.1.1] " Jan Hubicka
                         ` (2 preceding siblings ...)
  2002-05-20 10:38       ` Jakub Jelinek
@ 2002-05-20 12:07       ` Mark Mitchell
  3 siblings, 0 replies; 33+ messages in thread
From: Mark Mitchell @ 2002-05-20 12:07 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Michel LESPINASSE, Roger Sayle, gcc, Richard Henderson,
	gcc-patches, mark



--On Monday, May 20, 2002 04:48:39 PM +0200 Jan Hubicka <jh@suse.cz> wrote:

>>
>> Hmm, an pasto.
>> In memcpy case I got it right, while in memset I broke it.  I am
>> attaching patch I am testing currently.  OK for mainline/branch assuming
>> it passes?
>>
>> COncerning the inlining, gcc inlines all memcpys with size smaller than
>> 64 bytes. Perhaps this should be extended to 128 bytes in case we are
>> still about 2 times as bad. This is partly due to lame implementation of
>> memset in glibc too :(
>
> Mark,
> Would this patch be OK for 3.1.1 branch?  It fixes serious
> misscompilation. Not really regression, since extra switch is needed, but
> that switch seems to be popular.

Which extra switch?  I couldn't see it in the code, but I can never
remember all the switches. :-)

If this patch has been applied on the mainline and no problems have been
reported, then it is OK for the branch.

Thanks,

--
Mark Mitchell                   mark@codesourcery.com
CodeSourcery, LLC               http://www.codesourcery.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* GCC performance regression - up to 20% ?
@ 2002-04-20 18:13 Michel LESPINASSE
  2002-04-22 14:33 ` GCC performance regression - its memset ! Michel LESPINASSE
  0 siblings, 1 reply; 33+ messages in thread
From: Michel LESPINASSE @ 2002-04-20 18:13 UTC (permalink / raw)
  To: gcc list

Hi,

I have downloaded the latest 3.1 snapshot (20020415) and ran some
performance tests. So far I've been impressed by the FP performance,
but kinda disappointed by the integer performance.

The benchmarks I've run are two libraries I maintain, libmpeg2 and
liba52. These are used by several open-source dvd players, and are
quite CPU intensive (especially libmpeg2). So here are my results,
using gcc 2.95 as a reference:

First the good news: liba52 (mostly FP intensive workload)
on athlon tbird 950, using -mcpu=pentiumpro:
gcc-3.0 is between 4.5% and 6.5% faster than 2.95.4 depending on streams
gcc-3.1 snapshot is between 8% and 9.5% faster than 2.95.4
from these measurements 3.1 has a very nice performance, very close to
intel's icc. Great work ! Also using -march=athlon-tbird and
generating sse code, I can get yet a few extra % of performance.

Now the bad news: for libmepg2, which is an integer-only workload, I
get a 10% to 20% performance regression between 2.95.4 and 3.1... 3.0
was already slower than 2.95.4, but 3.1 seems to be worse for this
workload at least.

libmpeg2, on athlon tbird 950, with mmx optimizations:
gcc-3.0 is about 2% slower than 2.95.4
gcc-3.1 snapshot is about 10% slower than 2.95.4

libmpeg2, on athlon tbird 950, using pure C code:
gcc-3.0 is about 4.5% slower than 2.95.4
gcc-3.1 snapshot is about 5.5% slower than 2.95.4

libmpeg2, on celeron 366, with mmx optimizations:
gcc-3.0 is about 4% slower than 2.95.4
gcc-3.1 snapshot is about 20.5% slower than 2.95.4 (!!!!)

These results are all very repeatable. the celeron 366 results are the
most worrying, as this processor already has borderline performance
for decoding mpeg2 streams.

Is there a known performance regression in current GCCs (say, do they
get lower SPECint scores ?) or is it only with my code ?

Also, is there anything I could do in my code to enhance performance
with newer gcc versions ? One thing I noticed is that 3.1 snapshot
produces less inlining than 3.0 or 2.95. This probably accounts for
some of the slowdown I see when using mmx optimizations, as my mmx
routines are written using a few routines that I really expect to get
inlined. Is there any way I can get back control about that, so that
gcc honours the inline keyword ? I have not managed to do this either.

BTW, these two apps I mentionned can be found at
http://libmpeg2.sourceforge.net/
http://liba52.sourceforge.net/

Puzzled,

-- 
Michel "Walken" LESPINASSE
Is this the best that god can do ? Then I'm not impressed.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* GCC performance regression - its memset !
  2002-04-20 18:13 GCC performance regression - up to 20% ? Michel LESPINASSE
@ 2002-04-22 14:33 ` Michel LESPINASSE
  2002-04-22 14:58   ` Jason R Thorpe
  2002-04-22 17:10   ` Richard Henderson
  0 siblings, 2 replies; 33+ messages in thread
From: Michel LESPINASSE @ 2002-04-22 14:33 UTC (permalink / raw)
  To: gcc list

OK, so I worked more to find the cause of the slowdown, and I figured
out its all because of memset(). This function seems to be about twice
slower than in 2.95, and also for some reason the time spent in memset
does not show up in gprof.

Here is a test case:

--------------------------- foo.c ------------------------------
#include <string.h>

short table[64];

void bar (void);

int main (void)
{
    int i;

    bar ();

    for (i = 0; i < 100000000; i++)
        memset (table + 1, 0, 63 * sizeof(short));
    return 0;
}
----------------------------- end of foo.c ------------------------

----------------------------- bar.c -------------------------------
void bar (void)
{
}
----------------------------- end of bar.c ------------------------

# gcc-2.95 -g -O3 -p foo.c bar.c
# time ./a.out
./a.out  5.75s user 0.00s system 100% cpu 5.739 total
# gprof -bp ./a.out
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  Ts/call  Ts/call  name    
100.00      5.74     5.74                             main
  0.00      5.74     0.00        1     0.00     0.00  bar


# gcc-3.1 -g -O3 -p foo.c bar.c
# time ./a.out
./a.out  10.78s user 0.00s system 101% cpu 10.634 total
# gprof -bp ./a.out
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  Ts/call  Ts/call  name    
100.00      0.62     0.62                             main
  0.00      0.62     0.00        1     0.00     0.00  bar


gcc-3.1 snapshot is about twice slower than 2.95 on that test case,
and for some reason the gprof output is bogus (it does not account for
the time spent in memset), while it was not with 2.95.

I did not know my code spent that much time in memset, I'll see what I
can do about it.

Hope this helps,

-- 
Michel "Walken" LESPINASSE
Is this the best that god can do ? Then I'm not impressed.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GCC performance regression - its memset !
  2002-04-22 14:33 ` GCC performance regression - its memset ! Michel LESPINASSE
@ 2002-04-22 14:58   ` Jason R Thorpe
  2002-04-22 15:27     ` Michel LESPINASSE
  2002-04-22 16:59     ` Segher Boessenkool
  2002-04-22 17:10   ` Richard Henderson
  1 sibling, 2 replies; 33+ messages in thread
From: Jason R Thorpe @ 2002-04-22 14:58 UTC (permalink / raw)
  To: Michel LESPINASSE; +Cc: gcc list

On Mon, Apr 22, 2002 at 02:32:22PM -0700, Michel LESPINASSE wrote:

 > gcc-3.1 snapshot is about twice slower than 2.95 on that test case,
 > and for some reason the gprof output is bogus (it does not account for
 > the time spent in memset), while it was not with 2.95.

gprof doesn't see it because gcc is doing the memset inline, presumably; it
does this in certain cases where it knows the size at compile time.

Try running your test with -fno-builtin.  (I'm not suggesting this as a
fix for your performance issue, just as an explanation of why memset() is
invisible to gprof in your testcase).

-- 
        -- Jason R. Thorpe <thorpej@wasabisystems.com>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GCC performance regression - its memset !
  2002-04-22 14:58   ` Jason R Thorpe
@ 2002-04-22 15:27     ` Michel LESPINASSE
  2002-04-22 16:59     ` Segher Boessenkool
  1 sibling, 0 replies; 33+ messages in thread
From: Michel LESPINASSE @ 2002-04-22 15:27 UTC (permalink / raw)
  To: Jason R Thorpe, gcc list

On Mon, Apr 22, 2002 at 02:41:25PM -0700, Jason R Thorpe wrote:
> gprof doesn't see it because gcc is doing the memset inline, presumably; it
> does this in certain cases where it knows the size at compile time.

I'm not surprised that there is no line for memset in the flat
profile. But, I'm surprised that the cumulative time in the flat
profile is not equal to the user time in the time command. 'normal'
inlines (i.e. the ones I could write by declaring an inline function)
are accounted normally by gprof (i.e. they do show up in the
cumulative time).

-- 
Michel "Walken" LESPINASSE
Is this the best that god can do ? Then I'm not impressed.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GCC performance regression - its memset !
  2002-04-22 14:58   ` Jason R Thorpe
  2002-04-22 15:27     ` Michel LESPINASSE
@ 2002-04-22 16:59     ` Segher Boessenkool
  1 sibling, 0 replies; 33+ messages in thread
From: Segher Boessenkool @ 2002-04-22 16:59 UTC (permalink / raw)
  To: thorpej; +Cc: Michel LESPINASSE, gcc list

Jason R Thorpe wrote:
> 
> On Mon, Apr 22, 2002 at 02:32:22PM -0700, Michel LESPINASSE wrote:
> 
>  > gcc-3.1 snapshot is about twice slower than 2.95 on that test case,
>  > and for some reason the gprof output is bogus (it does not account for
>  > the time spent in memset), while it was not with 2.95.
> 
> gprof doesn't see it because gcc is doing the memset inline, presumably; it
> does this in certain cases where it knows the size at compile time.
> 
> Try running your test with -fno-builtin.  (I'm not suggesting this as a
> fix for your performance issue, just as an explanation of why memset() is
> invisible to gprof in your testcase).

Erm, no.  In the 2.95 case, GCC _did_ inline the memset();  that's why it
showed up as 5.74 seconds in main().  In the 3.1 case, it was not inlined;
gprof doesn't show you the time spent in libc, as libc is not compiled
with profiling enabled.  Inlined functions are never profiled separately.

Linking statically will make gprof show you the time spent in shared library
functions (but not the call graph to those); or you can link against a libc
that was compiled with profiling enabled, so you get a call graph as well.

Segher

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GCC performance regression - its memset !
  2002-04-22 14:33 ` GCC performance regression - its memset ! Michel LESPINASSE
  2002-04-22 14:58   ` Jason R Thorpe
@ 2002-04-22 17:10   ` Richard Henderson
  2002-04-22 17:13     ` Michel LESPINASSE
  1 sibling, 1 reply; 33+ messages in thread
From: Richard Henderson @ 2002-04-22 17:10 UTC (permalink / raw)
  To: Michel LESPINASSE; +Cc: gcc list

On Mon, Apr 22, 2002 at 02:32:22PM -0700, Michel LESPINASSE wrote:
> OK, so I worked more to find the cause of the slowdown, and I figured
> out its all because of memset().

See whether or not disabling glibc's inline expansion of 
memset affects 2.95 vs 3.x with -D__NO_STRING_INLINES.


r~

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GCC performance regression - its memset !
  2002-04-22 17:10   ` Richard Henderson
@ 2002-04-22 17:13     ` Michel LESPINASSE
  2002-04-22 17:39       ` Richard Henderson
  2002-04-23  2:39       ` Jan Hubicka
  0 siblings, 2 replies; 33+ messages in thread
From: Michel LESPINASSE @ 2002-04-22 17:13 UTC (permalink / raw)
  To: Richard Henderson, gcc list

On Mon, Apr 22, 2002 at 04:59:53PM -0700, Richard Henderson wrote:
> See whether or not disabling glibc's inline expansion of 
> memset affects 2.95 vs 3.x with -D__NO_STRING_INLINES.

I can get gcc 2.95 to be as slow as 3.1 snapshot by using both
-fno-builtin and -D__NO_STRING_INLINES

but, I cant get gcc 3.1 to do the inlining for me

If I dont use -fno-builtin -D__NO_STRING_INLINES, and just compile with -O3,
gcc 2.95 generates this:

        movl $table+2,%edi
        cld
        movl $31,%ecx
        rep
        stosl
        stosw

while 3.1 snapshot generates this:

        movl    $table+2, (%esp)
        movl    $0, 4(%esp)
        movl    $126, 8(%esp)
        call    memset

Hope this helps,

-- 
Michel "Walken" LESPINASSE
Is this the best that god can do ? Then I'm not impressed.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GCC performance regression - its memset !
  2002-04-22 17:13     ` Michel LESPINASSE
@ 2002-04-22 17:39       ` Richard Henderson
  2002-04-22 17:49         ` Michel LESPINASSE
  2002-04-23  2:39       ` Jan Hubicka
  1 sibling, 1 reply; 33+ messages in thread
From: Richard Henderson @ 2002-04-22 17:39 UTC (permalink / raw)
  To: Michel LESPINASSE; +Cc: gcc list

On Mon, Apr 22, 2002 at 05:10:45PM -0700, Michel LESPINASSE wrote:
> I can get gcc 2.95 to be as slow as 3.1 snapshot by using both
> -fno-builtin and -D__NO_STRING_INLINES

I wasn't interested in -fno-builtin, but only differences seen
when both 2.95 and 3.1 are using -D__NO_STRING_INLINES.


r~

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GCC performance regression - its memset !
  2002-04-22 17:39       ` Richard Henderson
@ 2002-04-22 17:49         ` Michel LESPINASSE
  2002-04-23  5:03           ` Falk Hueffner
  0 siblings, 1 reply; 33+ messages in thread
From: Michel LESPINASSE @ 2002-04-22 17:49 UTC (permalink / raw)
  To: Richard Henderson, gcc list

On Mon, Apr 22, 2002 at 05:13:35PM -0700, Richard Henderson wrote:
> I wasn't interested in -fno-builtin, but only differences seen
> when both 2.95 and 3.1 are using -D__NO_STRING_INLINES.

If I use just -O3 -D__NO_STRING_INLINES, I get the same behaviour as
with just -O3 (i.e. 2.95 inlines the memset and 3.1 doesnt)

-- 
Michel "Walken" LESPINASSE
Is this the best that god can do ? Then I'm not impressed.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GCC performance regression - its memset !
  2002-04-22 17:49         ` Michel LESPINASSE
@ 2002-04-23  5:03           ` Falk Hueffner
  2002-04-23  6:53             ` Andreas Schwab
  0 siblings, 1 reply; 33+ messages in thread
From: Falk Hueffner @ 2002-04-23  5:03 UTC (permalink / raw)
  To: gcc list

Hi,

while we're at it, when you have a prototype for memset, gcc forgets
about the alignment:

void f1(unsigned long *p) { memset(p, 0, 16); }
#include <string.h>
void f2(unsigned long *p) { memset(p, 0, 16); }

-> (Alpha)

0000000000000000 <f1>:
   0:   08 00 f0 b7     stq     zero,8(a0)
   4:   1f 04 ff 5f     fnop
   8:   00 00 f0 b7     stq     zero,0(a0)
   c:   01 80 fa 6b     ret
0000000000000020 <f2>:
  20:   0f 00 30 2c     ldq_u   t0,15(a0)
  24:   00 00 50 2c     ldq_u   t1,0(a0)
  28:   41 0e 30 48     mskqh   t0,a0,t0
  2c:   0f 00 30 3c     stq_u   t0,15(a0)
  30:   42 06 50 48     mskql   t1,a0,t1
  34:   08 00 f0 3f     stq_u   zero,8(a0)
  38:   00 00 50 3c     stq_u   t1,0(a0)
  3c:   01 80 fa 6b     ret

Is there anything that can be done about that?

        Falk

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GCC performance regression - its memset !
  2002-04-23  5:03           ` Falk Hueffner
@ 2002-04-23  6:53             ` Andreas Schwab
  0 siblings, 0 replies; 33+ messages in thread
From: Andreas Schwab @ 2002-04-23  6:53 UTC (permalink / raw)
  To: Falk Hueffner; +Cc: gcc list

Falk Hueffner <falk.hueffner@student.uni-tuebingen.de> writes:

|> Hi,
|> 
|> while we're at it, when you have a prototype for memset, gcc forgets
|> about the alignment:
|> 
|> void f1(unsigned long *p) { memset(p, 0, 16); }
|> #include <string.h>
|> void f2(unsigned long *p) { memset(p, 0, 16); }
|> 
|> -> (Alpha)
|> 
|> 0000000000000000 <f1>:
|>    0:   08 00 f0 b7     stq     zero,8(a0)
|>    4:   1f 04 ff 5f     fnop
|>    8:   00 00 f0 b7     stq     zero,0(a0)
|>    c:   01 80 fa 6b     ret
|> 0000000000000020 <f2>:
|>   20:   0f 00 30 2c     ldq_u   t0,15(a0)
|>   24:   00 00 50 2c     ldq_u   t1,0(a0)
|>   28:   41 0e 30 48     mskqh   t0,a0,t0
|>   2c:   0f 00 30 3c     stq_u   t0,15(a0)
|>   30:   42 06 50 48     mskql   t1,a0,t1
|>   34:   08 00 f0 3f     stq_u   zero,8(a0)
|>   38:   00 00 50 3c     stq_u   t1,0(a0)
|>   3c:   01 80 fa 6b     ret
|> 
|> Is there anything that can be done about that?

Please try -D__NO_STRING_INLINES.

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE GmbH, Deutschherrnstr. 15-19, D-90429 NÃ¼rnberg
Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GCC performance regression - its memset !
  2002-04-22 17:13     ` Michel LESPINASSE
  2002-04-22 17:39       ` Richard Henderson
@ 2002-04-23  2:39       ` Jan Hubicka
  2002-04-23 13:36         ` Michel LESPINASSE
  1 sibling, 1 reply; 33+ messages in thread
From: Jan Hubicka @ 2002-04-23  2:39 UTC (permalink / raw)
  To: Michel LESPINASSE; +Cc: Richard Henderson, gcc list

> On Mon, Apr 22, 2002 at 04:59:53PM -0700, Richard Henderson wrote:
> > See whether or not disabling glibc's inline expansion of 
> > memset affects 2.95 vs 3.x with -D__NO_STRING_INLINES.
> 
> I can get gcc 2.95 to be as slow as 3.1 snapshot by using both
> -fno-builtin and -D__NO_STRING_INLINES
> 
> but, I cant get gcc 3.1 to do the inlining for me
> 
> If I dont use -fno-builtin -D__NO_STRING_INLINES, and just compile with -O3,
> gcc 2.95 generates this:
> 
>         movl $table+2,%edi
>         cld
>         movl $31,%ecx
>         rep
>         stosl
>         stosw
> 
> while 3.1 snapshot generates this:
> 
>         movl    $table+2, (%esp)
>         movl    $0, 4(%esp)
>         movl    $126, 8(%esp)
>         call    memset
I guess the inlining threshold is too low or the default memset implementation
too lame.
I was tunning it for Athlon, so the mileage may warry from CPU to CPU.
I will investigate the misscompilation first and check this second.

Honza
> 
> Hope this helps,
> 
> -- 
> Michel "Walken" LESPINASSE
> Is this the best that god can do ? Then I'm not impressed.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GCC performance regression - its memset !
  2002-04-23  2:39       ` Jan Hubicka
@ 2002-04-23 13:36         ` Michel LESPINASSE
  2002-04-24  0:30           ` Jan Hubicka
  2002-04-24  3:32           ` Jan Hubicka
  0 siblings, 2 replies; 33+ messages in thread
From: Michel LESPINASSE @ 2002-04-23 13:36 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Richard Henderson, gcc list

On Tue, Apr 23, 2002 at 11:25:40AM +0200, Jan Hubicka wrote:
> I guess the inlining threshold is too low or the default memset
> implementation too lame.  I was tunning it for Athlon, so the
> mileage may warry from CPU to CPU.  I will investigate the
> misscompilation first and check this second.

> Concerning the inlining, gcc inlines all memcpys with size smaller
> than 64 bytes. Perhaps this should be extended to 128 bytes in case
> we are still about 2 times as bad. This is partly due to lame
> implementation of memset in glibc too :(

When gcc does the inlining, performance seems to not be so bad. There
is probably still some untapped performance though, as some of the
initial and final alignment checks could be ommited when gcc already
knows about the alignment of the memory zone (like in my test case, it
was an array of shorts in the data segment, so it was known to be on a
two-byte boundary at least). But might be hard to code into gcc, I
dont know.

Also as I've been only giving bad news up to now, I wanted to say that
now that I've worked around the two issues I had with inlining and
with memset, the 3.1 snapshot does provide superior performance on my
libmpeg2 codebase, about 5% faster than 2.95.4, and that gets up to 8%
when using -fbranch-probabilities and 9% when using -mcpu=athlon-tbird
instead of the more generic -mcpu=pentiumpro. Nice work guys ! I am
still worried though, that other people will have the same trouble
with inlining as I did and not see all of the performance improvements
as a result.

Cheers,

-- 
Michel "Walken" LESPINASSE
Is this the best that god can do ? Then I'm not impressed.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GCC performance regression - its memset !
  2002-04-23 13:36         ` Michel LESPINASSE
@ 2002-04-24  0:30           ` Jan Hubicka
  2002-04-24  0:50             ` Jakub Jelinek
  2002-04-24  3:32           ` Jan Hubicka
  1 sibling, 1 reply; 33+ messages in thread
From: Jan Hubicka @ 2002-04-24  0:30 UTC (permalink / raw)
  To: Michel LESPINASSE; +Cc: Jan Hubicka, Richard Henderson, gcc list

> On Tue, Apr 23, 2002 at 11:25:40AM +0200, Jan Hubicka wrote:
> > I guess the inlining threshold is too low or the default memset
> > implementation too lame.  I was tunning it for Athlon, so the
> > mileage may warry from CPU to CPU.  I will investigate the
> > misscompilation first and check this second.
> 
> > Concerning the inlining, gcc inlines all memcpys with size smaller
> > than 64 bytes. Perhaps this should be extended to 128 bytes in case
> > we are still about 2 times as bad. This is partly due to lame
> > implementation of memset in glibc too :(
> 
> When gcc does the inlining, performance seems to not be so bad. There
> is probably still some untapped performance though, as some of the
> initial and final alignment checks could be ommited when gcc already
> knows about the alignment of the memory zone (like in my test case, it

When it knows, it should avoid it.  Definitly on array of shorts, the alignment
to even byte is not done.  It is dificult to make it expect that array of
shorts is 4 byte aligned, as ABI does not specify this, so it may not be.
GCC has new alignment tracking code, so it should be better than any previous
version, but still not that good.  (for instnace when array is static, it
definitly do have chance to conclude so, but it does not, however majority of
string functions come to computed addresses)

> was an array of shorts in the data segment, so it was known to be on a
> two-byte boundary at least). But might be hard to code into gcc, I
> dont know.
> 
> Also as I've been only giving bad news up to now, I wanted to say that
> now that I've worked around the two issues I had with inlining and
> with memset, the 3.1 snapshot does provide superior performance on my
> libmpeg2 codebase, about 5% faster than 2.95.4, and that gets up to 8%
> when using -fbranch-probabilities and 9% when using -mcpu=athlon-tbird

That sounds good :)
> instead of the more generic -mcpu=pentiumpro. Nice work guys ! I am
> still worried though, that other people will have the same trouble
> with inlining as I did and not see all of the performance improvements
> as a result.

I will send patch to increase the constant to 128.  I was re-benchmarking
the code and on P4/Athlon and my assembly memset, the 64 is just on the border
(ie inlined/not inlined sollution have less than 10% difference), setting
it to 128 does not make us to loose something.  For glibc implementation
128 is still a win to be inlined :(

Honza
> 
> Cheers,
> 
> -- 
> Michel "Walken" LESPINASSE
> Is this the best that god can do ? Then I'm not impressed.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GCC performance regression - its memset !
  2002-04-24  0:30           ` Jan Hubicka
@ 2002-04-24  0:50             ` Jakub Jelinek
  2002-04-24  1:00               ` Jan Hubicka
  0 siblings, 1 reply; 33+ messages in thread
From: Jakub Jelinek @ 2002-04-24  0:50 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Michel LESPINASSE, Richard Henderson, gcc list

On Wed, Apr 24, 2002 at 09:17:48AM +0200, Jan Hubicka wrote:
> > instead of the more generic -mcpu=pentiumpro. Nice work guys ! I am
> > still worried though, that other people will have the same trouble
> > with inlining as I did and not see all of the performance improvements
> > as a result.
> 
> I will send patch to increase the constant to 128.  I was re-benchmarking
> the code and on P4/Athlon and my assembly memset, the 64 is just on the border
> (ie inlined/not inlined sollution have less than 10% difference), setting
> it to 128 does not make us to loose something.  For glibc implementation
> 128 is still a win to be inlined :(

So please contribute it to glibc then...
Is yours an Athlon optimized memset or does it perform better on P3/P4 too?

	Jakub

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GCC performance regression - its memset !
  2002-04-24  0:50             ` Jakub Jelinek
@ 2002-04-24  1:00               ` Jan Hubicka
  0 siblings, 0 replies; 33+ messages in thread
From: Jan Hubicka @ 2002-04-24  1:00 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Jan Hubicka, Michel LESPINASSE, Richard Henderson, gcc list

> On Wed, Apr 24, 2002 at 09:17:48AM +0200, Jan Hubicka wrote:
> > > instead of the more generic -mcpu=pentiumpro. Nice work guys ! I am
> > > still worried though, that other people will have the same trouble
> > > with inlining as I did and not see all of the performance improvements
> > > as a result.
> > 
> > I will send patch to increase the constant to 128.  I was re-benchmarking
> > the code and on P4/Athlon and my assembly memset, the 64 is just on the border
> > (ie inlined/not inlined sollution have less than 10% difference), setting
> > it to 128 does not make us to loose something.  For glibc implementation
> > 128 is still a win to be inlined :(
> 
> So please contribute it to glibc then...
> Is yours an Athlon optimized memset or does it perform better on P3/P4 too?

It is fairly simple memset (for small cases only what matter is whether memset
do have fast path trought), but I do have Athlon optimized as well.
The problem with glibc is that it needs machinery for Athlon specific stuff
and AFAIK glibc maintainers didn't decided how to do that.

Honza
> 
> 	Jakub

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: GCC performance regression - its memset !
  2002-04-23 13:36         ` Michel LESPINASSE
  2002-04-24  0:30           ` Jan Hubicka
@ 2002-04-24  3:32           ` Jan Hubicka
  1 sibling, 0 replies; 33+ messages in thread
From: Jan Hubicka @ 2002-04-24  3:32 UTC (permalink / raw)
  To: Michel LESPINASSE; +Cc: Jan Hubicka, Richard Henderson, gcc list

> Also as I've been only giving bad news up to now, I wanted to say that
> now that I've worked around the two issues I had with inlining and
> with memset, the 3.1 snapshot does provide superior performance on my
> libmpeg2 codebase, about 5% faster than 2.95.4, and that gets up to 8%
> when using -fbranch-probabilities and 9% when using -mcpu=athlon-tbird
> instead of the more generic -mcpu=pentiumpro. Nice work guys ! I am

Just note that -mcpu=pentiumpro is not generic at all.  It optimizes
for pentiumpro and (in turn P2/P3), but nothing else.
Perhaps we can have switch to generate code working well on the modern CPUs,
but we don't

Honza

> still worried though, that other people will have the same trouble
> with inlining as I did and not see all of the performance improvements
> as a result.
> 
> Cheers,
> 
> -- 
> Michel "Walken" LESPINASSE
> Is this the best that god can do ? Then I'm not impressed.

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2002-05-21 17:21 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-04-22 23:07 GCC performance regression - its memset! Roger Sayle
2002-04-22 23:30 ` Michel LESPINASSE
2002-04-23  0:45   ` Michel LESPINASSE
2002-04-23  2:53   ` Jan Hubicka
2002-04-23  3:28     ` Richard Henderson
2002-05-20  8:06     ` [3.1.1] " Jan Hubicka
2002-05-20  9:36       ` Glen Nakamura
2002-05-20 10:08       ` Richard Henderson
2002-05-20 10:38       ` Jakub Jelinek
2002-05-20 11:28         ` Roger Sayle
2002-05-20 12:57           ` Glen Nakamura
2002-05-20 16:58             ` Roger Sayle
2002-05-21  8:23             ` Jack Lloyd
2002-05-21  9:55               ` Glen Nakamura
2002-05-21 11:13                 ` Jack Lloyd
2002-05-21  8:04         ` Jan Hubicka
2002-05-20 12:07       ` Mark Mitchell
  -- strict thread matches above, loose matches on Subject: below --
2002-04-20 18:13 GCC performance regression - up to 20% ? Michel LESPINASSE
2002-04-22 14:33 ` GCC performance regression - its memset ! Michel LESPINASSE
2002-04-22 14:58   ` Jason R Thorpe
2002-04-22 15:27     ` Michel LESPINASSE
2002-04-22 16:59     ` Segher Boessenkool
2002-04-22 17:10   ` Richard Henderson
2002-04-22 17:13     ` Michel LESPINASSE
2002-04-22 17:39       ` Richard Henderson
2002-04-22 17:49         ` Michel LESPINASSE
2002-04-23  5:03           ` Falk Hueffner
2002-04-23  6:53             ` Andreas Schwab
2002-04-23  2:39       ` Jan Hubicka
2002-04-23 13:36         ` Michel LESPINASSE
2002-04-24  0:30           ` Jan Hubicka
2002-04-24  0:50             ` Jakub Jelinek
2002-04-24  1:00               ` Jan Hubicka
2002-04-24  3:32           ` Jan Hubicka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).