* Re: GCC performance regression - its memset! @ 2002-04-22 23:07 Roger Sayle 2002-04-22 23:30 ` Michel LESPINASSE 0 siblings, 1 reply; 33+ messages in thread From: Roger Sayle @ 2002-04-22 23:07 UTC (permalink / raw) To: gcc, Michel Lespinasse, Richard Henderson, Jan Hubicka I think its one of Jan's changes. I can reproduce the problem, and fix it using "-minline-all-stringops" which forces 3.1 to inline the memset on i686. I was concerned that it was a middle-end bug with builtins, but it now appears to be an ia32 back-end issue. Michel, does "-minline-all-stringops" fix the problem for you? Roger -- Roger Sayle, E-mail: roger@eyesopen.com OpenEye Scientific Software, WWW: http://www.eyesopen.com/ Suite 1107, 3600 Cerrillos Road, Tel: (+1) 505-473-7385 Santa Fe, New Mexico, 87507. Fax: (+1) 505-473-0833 ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: GCC performance regression - its memset! 2002-04-22 23:07 GCC performance regression - its memset! Roger Sayle @ 2002-04-22 23:30 ` Michel LESPINASSE 2002-04-23 0:45 ` Michel LESPINASSE 2002-04-23 2:53 ` Jan Hubicka 0 siblings, 2 replies; 33+ messages in thread From: Michel LESPINASSE @ 2002-04-22 23:30 UTC (permalink / raw) To: Roger Sayle; +Cc: gcc, Richard Henderson, Jan Hubicka On Mon, Apr 22, 2002 at 11:13:09PM -0600, Roger Sayle wrote: > > I think its one of Jan's changes. I can reproduce the problem, and > fix it using "-minline-all-stringops" which forces 3.1 to inline the > memset on i686. I was concerned that it was a middle-end bug with > builtins, but it now appears to be an ia32 back-end issue. > > Michel, does "-minline-all-stringops" fix the problem for you? This option actually generates invalid code for me. Here is a test case: ------------------- cut here ----------------- #include <string.h> short table[64]; int main (void) { int i; for (i = 0; i < 64; i++) table[i] = 1234; memset (table, 0, 63 * sizeof(short)); return (table[63] != 0); } ------------------- cut here ----------------- This code should return 0, however it returns 1 (compiled with -O3 -minline-all-stringops) Here is an extract from the generated asm (the memset part of it): movl $table, %edi testl $1, %edi <- test 1-byte alignment (hmmm, isnt table already two-byte aligned, being a short ?) movl $126, %eax <- we want to clear 126 bytes je .L7 movb $0, table movl $table+1, %edi <- now edi is guaranteed two-byte-aligned movl $125, %eax .L7: testl $2, %edi <- test 4-byte alignment je .L8 movw $0, (%edi) subl $2, %eax <- now edi is guaranteed four-byte-aligned addl $2, %edi .L8: cld movl %eax, %ecx xorl %eax, %eax shrl $2, %ecx <- number of 4-byte words remaining rep stosl testl $2, %edi <- ooops, its really meant to test the remainder not the address !!! so test will always fail. je .L9 movw $0, (%edi) addl $2, %edi .L9: testl $1, %edi <- that one too. je .L10 movb $0, (%edi) .L10: 2.95 was generating simpler code: movl $table,%edi xorl %eax,%eax cld movl $31,%ecx rep stosl stosw This did not take care about alignment issues, but was simpler and actually faster on my athlon. Hope this helps, -- Michel "Walken" LESPINASSE Is this the best that god can do ? Then I'm not impressed. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: GCC performance regression - its memset! 2002-04-22 23:30 ` Michel LESPINASSE @ 2002-04-23 0:45 ` Michel LESPINASSE 2002-04-23 2:53 ` Jan Hubicka 1 sibling, 0 replies; 33+ messages in thread From: Michel LESPINASSE @ 2002-04-23 0:45 UTC (permalink / raw) To: Roger Sayle; +Cc: gcc, Richard Henderson, Jan Hubicka My test case was actually wrong - the final test should be with table[62] not table[63]. But, with this correction, it does exhibit the bug I was mentionning. Sorry for the mistake :/ On Mon, Apr 22, 2002 at 11:07:09PM -0700, Michel LESPINASSE wrote: > This option actually generates invalid code for me. Here is a test case: > ------------------- cut here ----------------- > #include <string.h> > > short table[64]; > > int main (void) > { > int i; > > for (i = 0; i < 64; i++) > table[i] = 1234; > > memset (table, 0, 63 * sizeof(short)); > > return (table[63] != 0); > } > ------------------- cut here ----------------- > This code should return 0, however it returns 1 (compiled with -O3 > -minline-all-stringops) -- Michel "Walken" LESPINASSE Is this the best that god can do ? Then I'm not impressed. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: GCC performance regression - its memset! 2002-04-22 23:30 ` Michel LESPINASSE 2002-04-23 0:45 ` Michel LESPINASSE @ 2002-04-23 2:53 ` Jan Hubicka 2002-04-23 3:28 ` Richard Henderson 2002-05-20 8:06 ` [3.1.1] " Jan Hubicka 1 sibling, 2 replies; 33+ messages in thread From: Jan Hubicka @ 2002-04-23 2:53 UTC (permalink / raw) To: Michel LESPINASSE Cc: Roger Sayle, gcc, Richard Henderson, Jan Hubicka, gcc-patches > rep > stosl > testl $2, %edi <- ooops, its really meant to test the remainder > not the address !!! so test will always fail. > je .L9 > movw $0, (%edi) > addl $2, %edi > .L9: > testl $1, %edi <- that one too. > je .L10 > movb $0, (%edi) > .L10: Hmm, an pasto. In memcpy case I got it right, while in memset I broke it. I am attaching patch I am testing currently. OK for mainline/branch assuming it passes? COncerning the inlining, gcc inlines all memcpys with size smaller than 64 bytes. Perhaps this should be extended to 128 bytes in case we are still about 2 times as bad. This is partly due to lame implementation of memset in glibc too :( Tue Apr 23 11:48:53 CEST 2002 Jan HUbicka <jh@suse.cz> * i386.c (ix86_expand_clrstr): Fix pasto. Index: i386.c =================================================================== RCS file: /cvs/gcc/egcs/gcc/config/i386/i386.c,v retrieving revision 1.353.2.14 diff -c -3 -p -r1.353.2.14 i386.c *** i386.c 16 Apr 2002 18:16:36 -0000 1.353.2.14 --- i386.c 23 Apr 2002 09:47:50 -0000 *************** ix86_expand_clrstr (src, count_exp, alig *** 9451,9457 **** gen_rtx_SUBREG (SImode, zeroreg, 0))); if (TARGET_64BIT && (align <= 4 || count == 0)) { ! rtx label = ix86_expand_aligntest (destreg, 2); emit_insn (gen_strsetsi (destreg, gen_rtx_SUBREG (SImode, zeroreg, 0))); emit_label (label); --- 9451,9457 ---- gen_rtx_SUBREG (SImode, zeroreg, 0))); if (TARGET_64BIT && (align <= 4 || count == 0)) { ! rtx label = ix86_expand_aligntest (countreg, 2); emit_insn (gen_strsetsi (destreg, gen_rtx_SUBREG (SImode, zeroreg, 0))); emit_label (label); *************** ix86_expand_clrstr (src, count_exp, alig *** 9462,9468 **** gen_rtx_SUBREG (HImode, zeroreg, 0))); if (align <= 2 || count == 0) { ! rtx label = ix86_expand_aligntest (destreg, 2); emit_insn (gen_strsethi (destreg, gen_rtx_SUBREG (HImode, zeroreg, 0))); emit_label (label); --- 9462,9468 ---- gen_rtx_SUBREG (HImode, zeroreg, 0))); if (align <= 2 || count == 0) { ! rtx label = ix86_expand_aligntest (countreg, 2); emit_insn (gen_strsethi (destreg, gen_rtx_SUBREG (HImode, zeroreg, 0))); emit_label (label); *************** ix86_expand_clrstr (src, count_exp, alig *** 9473,9479 **** gen_rtx_SUBREG (QImode, zeroreg, 0))); if (align <= 1 || count == 0) { ! rtx label = ix86_expand_aligntest (destreg, 1); emit_insn (gen_strsetqi (destreg, gen_rtx_SUBREG (QImode, zeroreg, 0))); emit_label (label); --- 9473,9479 ---- gen_rtx_SUBREG (QImode, zeroreg, 0))); if (align <= 1 || count == 0) { ! rtx label = ix86_expand_aligntest (countreg, 1); emit_insn (gen_strsetqi (destreg, gen_rtx_SUBREG (QImode, zeroreg, 0))); emit_label (label); ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: GCC performance regression - its memset! 2002-04-23 2:53 ` Jan Hubicka @ 2002-04-23 3:28 ` Richard Henderson 2002-05-20 8:06 ` [3.1.1] " Jan Hubicka 1 sibling, 0 replies; 33+ messages in thread From: Richard Henderson @ 2002-04-23 3:28 UTC (permalink / raw) To: Jan Hubicka; +Cc: Michel LESPINASSE, Roger Sayle, gcc, gcc-patches On Tue, Apr 23, 2002 at 11:51:45AM +0200, Jan Hubicka wrote: > * i386.c (ix86_expand_clrstr): Fix pasto. Ok. r~ ^ permalink raw reply [flat|nested] 33+ messages in thread
* [3.1.1] Re: GCC performance regression - its memset! 2002-04-23 2:53 ` Jan Hubicka 2002-04-23 3:28 ` Richard Henderson @ 2002-05-20 8:06 ` Jan Hubicka 2002-05-20 9:36 ` Glen Nakamura ` (3 more replies) 1 sibling, 4 replies; 33+ messages in thread From: Jan Hubicka @ 2002-05-20 8:06 UTC (permalink / raw) To: Jan Hubicka Cc: Michel LESPINASSE, Roger Sayle, gcc, Richard Henderson, gcc-patches, mark > > Hmm, an pasto. > In memcpy case I got it right, while in memset I broke it. I am attaching patch > I am testing currently. OK for mainline/branch assuming it passes? > > COncerning the inlining, gcc inlines all memcpys with size smaller than 64 bytes. > Perhaps this should be extended to 128 bytes in case we are still about 2 times as bad. > This is partly due to lame implementation of memset in glibc too :( Mark, Would this patch be OK for 3.1.1 branch? It fixes serious misscompilation. Not really regression, since extra switch is needed, but that switch seems to be popular. > > > Tue Apr 23 11:48:53 CEST 2002 Jan HUbicka <jh@suse.cz> > * i386.c (ix86_expand_clrstr): Fix pasto. > Index: i386.c > =================================================================== > RCS file: /cvs/gcc/egcs/gcc/config/i386/i386.c,v > retrieving revision 1.353.2.14 > diff -c -3 -p -r1.353.2.14 i386.c > *** i386.c 16 Apr 2002 18:16:36 -0000 1.353.2.14 > --- i386.c 23 Apr 2002 09:47:50 -0000 > *************** ix86_expand_clrstr (src, count_exp, alig > *** 9451,9457 **** > gen_rtx_SUBREG (SImode, zeroreg, 0))); > if (TARGET_64BIT && (align <= 4 || count == 0)) > { > ! rtx label = ix86_expand_aligntest (destreg, 2); > emit_insn (gen_strsetsi (destreg, > gen_rtx_SUBREG (SImode, zeroreg, 0))); > emit_label (label); > --- 9451,9457 ---- > gen_rtx_SUBREG (SImode, zeroreg, 0))); > if (TARGET_64BIT && (align <= 4 || count == 0)) > { > ! rtx label = ix86_expand_aligntest (countreg, 2); > emit_insn (gen_strsetsi (destreg, > gen_rtx_SUBREG (SImode, zeroreg, 0))); > emit_label (label); > *************** ix86_expand_clrstr (src, count_exp, alig > *** 9462,9468 **** > gen_rtx_SUBREG (HImode, zeroreg, 0))); > if (align <= 2 || count == 0) > { > ! rtx label = ix86_expand_aligntest (destreg, 2); > emit_insn (gen_strsethi (destreg, > gen_rtx_SUBREG (HImode, zeroreg, 0))); > emit_label (label); > --- 9462,9468 ---- > gen_rtx_SUBREG (HImode, zeroreg, 0))); > if (align <= 2 || count == 0) > { > ! rtx label = ix86_expand_aligntest (countreg, 2); > emit_insn (gen_strsethi (destreg, > gen_rtx_SUBREG (HImode, zeroreg, 0))); > emit_label (label); > *************** ix86_expand_clrstr (src, count_exp, alig > *** 9473,9479 **** > gen_rtx_SUBREG (QImode, zeroreg, 0))); > if (align <= 1 || count == 0) > { > ! rtx label = ix86_expand_aligntest (destreg, 1); > emit_insn (gen_strsetqi (destreg, > gen_rtx_SUBREG (QImode, zeroreg, 0))); > emit_label (label); > --- 9473,9479 ---- > gen_rtx_SUBREG (QImode, zeroreg, 0))); > if (align <= 1 || count == 0) > { > ! rtx label = ix86_expand_aligntest (countreg, 1); > emit_insn (gen_strsetqi (destreg, > gen_rtx_SUBREG (QImode, zeroreg, 0))); > emit_label (label); ^ permalink raw reply [flat|nested] 33+ messages in thread
* [3.1.1] Re: GCC performance regression - its memset! 2002-05-20 8:06 ` [3.1.1] " Jan Hubicka @ 2002-05-20 9:36 ` Glen Nakamura 2002-05-20 10:08 ` Richard Henderson ` (2 subsequent siblings) 3 siblings, 0 replies; 33+ messages in thread From: Glen Nakamura @ 2002-05-20 9:36 UTC (permalink / raw) To: jh; +Cc: walken, roger, gcc, rth, gcc-patches, mark > Tue Apr 23 11:48:53 CEST 2002 Jan HUbicka <jh@suse.cz> > * i386.c (ix86_expand_clrstr): Fix pasto. I believe this patch will fix optimization/6703: http://gcc.gnu.org/cgi-bin/gnatsweb.pl?cmd=view%20audit-trail&database=gcc&pr=6703 It also fixes a miscompilation of mozilla bug #145267: http://bugzilla.mozilla.org/show_bug.cgi?id=145267 BTW, I had a few more testsuite failures when I added this patch. I'm not sure if it's related though... anyone else get these errors? - Glen Nakamura Test Run By glen on Sun May 19 21:00:16 2002 Native configuration is i586-pc-linux-gnu === gcc tests === Schedule of variations: unix Running target unix FAIL: gcc.c-torture/execute/20020402-1.c execution, -Os FAIL: gcc.c-torture/execute/loop-2e.c execution, -Os FAIL: gcc.c-torture/execute/loop-3c.c execution, -Os FAIL: gcc.dg/format/c90-scanf-1.c writing into NULL (test for warnings, line 118) FAIL: gcc.dg/format/c90-scanf-1.c (test for excess errors) FAIL: gcc.dg/format/c99-printf-1.c %hhn unsigned char (test for warnings, line 196) FAIL: gcc.dg/format/c99-printf-1.c (test for excess errors) FAIL: gcc.dg/format/ext-3.c bad %-_ (test for warnings, line 211) FAIL: gcc.dg/format/ext-3.c bad %-0 (test for warnings, line 212) FAIL: gcc.dg/format/ext-3.c bad %_0 (test for warnings, line 213) FAIL: gcc.dg/format/ext-3.c bad %^# (test for warnings, line 215) FAIL: gcc.dg/format/ext-3.c (test for excess errors) ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [3.1.1] Re: GCC performance regression - its memset! 2002-05-20 8:06 ` [3.1.1] " Jan Hubicka 2002-05-20 9:36 ` Glen Nakamura @ 2002-05-20 10:08 ` Richard Henderson 2002-05-20 10:38 ` Jakub Jelinek 2002-05-20 12:07 ` Mark Mitchell 3 siblings, 0 replies; 33+ messages in thread From: Richard Henderson @ 2002-05-20 10:08 UTC (permalink / raw) To: Jan Hubicka; +Cc: Michel LESPINASSE, Roger Sayle, gcc, gcc-patches, mark On Mon, May 20, 2002 at 04:48:39PM +0200, Jan Hubicka wrote: > > * i386.c (ix86_expand_clrstr): Fix pasto. Ok for branch. r~ ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [3.1.1] Re: GCC performance regression - its memset! 2002-05-20 8:06 ` [3.1.1] " Jan Hubicka 2002-05-20 9:36 ` Glen Nakamura 2002-05-20 10:08 ` Richard Henderson @ 2002-05-20 10:38 ` Jakub Jelinek 2002-05-20 11:28 ` Roger Sayle 2002-05-21 8:04 ` Jan Hubicka 2002-05-20 12:07 ` Mark Mitchell 3 siblings, 2 replies; 33+ messages in thread From: Jakub Jelinek @ 2002-05-20 10:38 UTC (permalink / raw) To: Jan Hubicka Cc: Michel LESPINASSE, Roger Sayle, gcc, Richard Henderson, gcc-patches, mark On Mon, May 20, 2002 at 04:48:39PM +0200, Jan Hubicka wrote: > > > > Hmm, an pasto. > > In memcpy case I got it right, while in memset I broke it. I am attaching patch > > I am testing currently. OK for mainline/branch assuming it passes? > > > > COncerning the inlining, gcc inlines all memcpys with size smaller than 64 bytes. > > Perhaps this should be extended to 128 bytes in case we are still about 2 times as bad. > > This is partly due to lame implementation of memset in glibc too :( > > Mark, > Would this patch be OK for 3.1.1 branch? It fixes serious misscompilation. > Not really regression, since extra switch is needed, but that switch seems > to be popular. Do you consider -O or -O2 a special switch? Anyway, could the testcase be commited with the patch too (the bug shows up in C too BTW)? --- gcc/testsuite/gcc.c-torture/execute/20020520-1.c.jj Thu Aug 30 22:30:55 2001 +++ gcc/testsuite/gcc.c-torture/execute/20020520-1.c Mon May 20 19:00:41 2002 @@ -0,0 +1,20 @@ +/* PR optimization/6703 */ +extern void abort (void); +extern void exit (int); + +void foo (void **x, int y) +{ + __builtin_memset (x, 0, y); +} + +int main () +{ + unsigned char x[8] __attribute__ ((aligned (4))); + int i; + + __builtin_memset (x, 0x5a, sizeof (x)); + foo ((void **) (x + 3), 1); + for (i = 0; i < 8; i++) + if (x[i] != (i != 3 ? 0x5a : 0)) + abort (); + exit (0); +} > > > > > > Tue Apr 23 11:48:53 CEST 2002 Jan HUbicka <jh@suse.cz> > > * i386.c (ix86_expand_clrstr): Fix pasto. Jakub ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [3.1.1] Re: GCC performance regression - its memset! 2002-05-20 10:38 ` Jakub Jelinek @ 2002-05-20 11:28 ` Roger Sayle 2002-05-20 12:57 ` Glen Nakamura 2002-05-21 8:04 ` Jan Hubicka 1 sibling, 1 reply; 33+ messages in thread From: Roger Sayle @ 2002-05-20 11:28 UTC (permalink / raw) To: Jakub Jelinek; +Cc: Jan Hubicka, gcc-patches, gcc > Do you consider -O or -O2 a special switch? > Anyway, could the testcase be commited with the patch too (the bug shows > up in C too BTW)? > > --- gcc/testsuite/gcc.c-torture/execute/20020520-1.c.jj Thu Aug 30 22:30:55 2001 > +++ gcc/testsuite/gcc.c-torture/execute/20020520-1.c Mon May 20 19:00:41 2002 > @@ -0,0 +1,20 @@ > +/* PR optimization/6703 */ > +extern void abort (void); > +extern void exit (int); > + > +void foo (void **x, int y) > +{ > + __builtin_memset (x, 0, y); > +} > + > +int main () > +{ > + unsigned char x[8] __attribute__ ((aligned (4))); > + int i; > + > + __builtin_memset (x, 0x5a, sizeof (x)); > + foo ((void **) (x + 3), 1); > + for (i = 0; i < 8; i++) > + if (x[i] != (i != 3 ? 0x5a : 0)) > + abort (); > + exit (0); > +} Hi Jakub, As I've mentioned in a previous e-mail, I believe that this test case is poorly formed; quite clearly the x argument to foo is incorrectly aligned for the type. On many platforms, the above code (and even with foo defined as "*x = 0;") will fail at runtime. The benefit of Jan's patch is that on machines without alignment restrictions, we'll still produce the "expected behaviour" for incorrectly written programs. Mozilla's source code is incorrect and needs to be fixed (use void* rather than void**). This isn't really a correctness PR, more a quality of implementation for undefined behaviour PR. Roger -- ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [3.1.1] Re: GCC performance regression - its memset! 2002-05-20 11:28 ` Roger Sayle @ 2002-05-20 12:57 ` Glen Nakamura 2002-05-20 16:58 ` Roger Sayle 2002-05-21 8:23 ` Jack Lloyd 0 siblings, 2 replies; 33+ messages in thread From: Glen Nakamura @ 2002-05-20 12:57 UTC (permalink / raw) To: roger; +Cc: jakub, jh, gcc > Hi Jakub, > > As I've mentioned in a previous e-mail, I believe that this test > case is poorly formed; quite clearly the x argument to foo is > incorrectly aligned for the type. On many platforms, the above > code (and even with foo defined as "*x = 0;") will fail at runtime. > > The benefit of Jan's patch is that on machines without alignment > restrictions, we'll still produce the "expected behaviour" for > incorrectly written programs. Mozilla's source code is incorrect > and needs to be fixed (use void* rather than void**). This isn't > really a correctness PR, more a quality of implementation for > undefined behaviour PR. > > Roger Aloha, FWIW, I think this really is a regression... The problem is really that memset doesn't properly set bytes at the tail of the buffer. How about this testcase? - Glen Nakamura /* PR optimization/6703 */ extern void abort (void); extern void exit (int); void foo (int *x, int y) { __builtin_memset (x, 0, y); } int main () { int x[2] = { 0x5a5a5a5a, 0x5a5a5a5a }; if (x[1] != 0x5a5a5a5a) abort (); foo (x, 5); if (x[1] == 0x5a5a5a5a) abort (); exit (0); } ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [3.1.1] Re: GCC performance regression - its memset! 2002-05-20 12:57 ` Glen Nakamura @ 2002-05-20 16:58 ` Roger Sayle 2002-05-21 8:23 ` Jack Lloyd 1 sibling, 0 replies; 33+ messages in thread From: Roger Sayle @ 2002-05-20 16:58 UTC (permalink / raw) To: Glen Nakamura; +Cc: jakub, jh, gcc Hi Glen, > FWIW, I think this really is a regression... The problem is really > that memset doesn't properly set bytes at the tail of the buffer. My apologies. You're right it really is a serious regression! > How about this testcase? The other possibility is to copy the memset-1.c and memset-2.c tests from CVS mainline to the branch's gcc.c-torture/execute. These "correctness" tests would also have uncovered the regression. Roger -- ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [3.1.1] Re: GCC performance regression - its memset! 2002-05-20 12:57 ` Glen Nakamura 2002-05-20 16:58 ` Roger Sayle @ 2002-05-21 8:23 ` Jack Lloyd 2002-05-21 9:55 ` Glen Nakamura 1 sibling, 1 reply; 33+ messages in thread From: Jack Lloyd @ 2002-05-21 8:23 UTC (permalink / raw) To: gcc This looks to be an endian issue. Some code: ------------- CUT ---------------- #include <stdio.h> void foo(void *x, int y) { __builtin_memset(x, 0, y); } int main() { int x[2] = { 0x5a5a5a5a, 0x5a5a5a5a }; printf("%08X %08X\n", x[0], x[1]); foo(x, 5); printf("%08X %08X\n", x[0], x[1]); return 0; } ------------- CUT ---------------- Let's try this on x86: 5A5A5A5A 5A5A5A5A 00000000 5A5A5A00 And again on SPARC: 5A5A5A5A 5A5A5A5A 00000000 005A5A5A where it comes out what you expect, at least if you expect what I expect (or something...) And, honestly, I'm not sure if this is how it's supposed to be or not. What are the rules for doing a memset across part of an object? Certainly, in both cases it sets 5 bytes to zero, and those bytes are contiguous in memory. The thing is, I can't imagine how you could actually change this on x86 to work like it does on a big-endian machine, because you would have to know the size of the type involved. HTH, -Jack On Mon, 20 May 2002, Glen Nakamura wrote: > > Hi Jakub, > > > > As I've mentioned in a previous e-mail, I believe that this test > > case is poorly formed; quite clearly the x argument to foo is > > incorrectly aligned for the type. On many platforms, the above > > code (and even with foo defined as "*x = 0;") will fail at runtime. > > > > The benefit of Jan's patch is that on machines without alignment > > restrictions, we'll still produce the "expected behaviour" for > > incorrectly written programs. Mozilla's source code is incorrect > > and needs to be fixed (use void* rather than void**). This isn't > > really a correctness PR, more a quality of implementation for > > undefined behaviour PR. > > > > Roger > > Aloha, > > FWIW, I think this really is a regression... The problem is really > that memset doesn't properly set bytes at the tail of the buffer. > How about this testcase? > > - Glen Nakamura > > > /* PR optimization/6703 */ > extern void abort (void); > extern void exit (int); > > void foo (int *x, int y) > { > __builtin_memset (x, 0, y); > } > > int main () > { > int x[2] = { 0x5a5a5a5a, 0x5a5a5a5a }; > > if (x[1] != 0x5a5a5a5a) > abort (); > foo (x, 5); > if (x[1] == 0x5a5a5a5a) > abort (); > exit (0); > } > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [3.1.1] Re: GCC performance regression - its memset! 2002-05-21 8:23 ` Jack Lloyd @ 2002-05-21 9:55 ` Glen Nakamura 2002-05-21 11:13 ` Jack Lloyd 0 siblings, 1 reply; 33+ messages in thread From: Glen Nakamura @ 2002-05-21 9:55 UTC (permalink / raw) To: lloyd; +Cc: gcc > This looks to be an endian issue. > > Some code: > ------------- CUT ---------------- > #include <stdio.h> > > void foo(void *x, int y) > { > __builtin_memset(x, 0, y); > } > > int main() > { > int x[2] = { 0x5a5a5a5a, 0x5a5a5a5a }; > printf("%08X %08X\n", x[0], x[1]); > foo(x, 5); > printf("%08X %08X\n", x[0], x[1]); > return 0; > } > ------------- CUT ---------------- > > Let's try this on x86: > > 5A5A5A5A 5A5A5A5A > 00000000 5A5A5A00 > > And again on SPARC: > > 5A5A5A5A 5A5A5A5A > 00000000 005A5A5A > > where it comes out what you expect, at least if you expect what I expect > (or something...) > > And, honestly, I'm not sure if this is how it's supposed to be or not. > What are the rules for doing a memset across part of an object? Certainly, > in both cases it sets 5 bytes to zero, and those bytes are contiguous in > memory. The thing is, I can't imagine how you could actually change this on > x86 to work like it does on a big-endian machine, because you would have to > know the size of the type involved. > > HTH, > -Jack > Actually, the failure occurs with gcc-3.1 on x86 when compiling with -O2. I would expect the following output from your example: 5A5A5A5A 5A5A5A5A 00000000 5A5A5A5A Only the first 4 bytes are set... - Glen Nakamura ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [3.1.1] Re: GCC performance regression - its memset! 2002-05-21 9:55 ` Glen Nakamura @ 2002-05-21 11:13 ` Jack Lloyd 0 siblings, 0 replies; 33+ messages in thread From: Jack Lloyd @ 2002-05-21 11:13 UTC (permalink / raw) To: Glen Nakamura; +Cc: gcc On Tue, 21 May 2002, Glen Nakamura wrote: Ah, OK. My mistake (no optimizations used on mine). -J > Actually, the failure occurs with gcc-3.1 on x86 when compiling with -O2. > I would expect the following output from your example: > > 5A5A5A5A 5A5A5A5A > 00000000 5A5A5A5A > > Only the first 4 bytes are set... > > - Glen Nakamura > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [3.1.1] Re: GCC performance regression - its memset! 2002-05-20 10:38 ` Jakub Jelinek 2002-05-20 11:28 ` Roger Sayle @ 2002-05-21 8:04 ` Jan Hubicka 1 sibling, 0 replies; 33+ messages in thread From: Jan Hubicka @ 2002-05-21 8:04 UTC (permalink / raw) To: Jakub Jelinek Cc: Jan Hubicka, Michel LESPINASSE, Roger Sayle, gcc, Richard Henderson, gcc-patches, mark > > Mark, > > Would this patch be OK for 3.1.1 branch? It fixes serious misscompilation. > > Not really regression, since extra switch is needed, but that switch seems > > to be popular. > > Do you consider -O or -O2 a special switch? > Anyway, could the testcase be commited with the patch too (the bug shows > up in C too BTW)? I meant -minline-all-stringops and didn't noticed this reproduce for some cases of default settings as well.. Honza ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [3.1.1] Re: GCC performance regression - its memset! 2002-05-20 8:06 ` [3.1.1] " Jan Hubicka ` (2 preceding siblings ...) 2002-05-20 10:38 ` Jakub Jelinek @ 2002-05-20 12:07 ` Mark Mitchell 3 siblings, 0 replies; 33+ messages in thread From: Mark Mitchell @ 2002-05-20 12:07 UTC (permalink / raw) To: Jan Hubicka Cc: Michel LESPINASSE, Roger Sayle, gcc, Richard Henderson, gcc-patches, mark --On Monday, May 20, 2002 04:48:39 PM +0200 Jan Hubicka <jh@suse.cz> wrote: >> >> Hmm, an pasto. >> In memcpy case I got it right, while in memset I broke it. I am >> attaching patch I am testing currently. OK for mainline/branch assuming >> it passes? >> >> COncerning the inlining, gcc inlines all memcpys with size smaller than >> 64 bytes. Perhaps this should be extended to 128 bytes in case we are >> still about 2 times as bad. This is partly due to lame implementation of >> memset in glibc too :( > > Mark, > Would this patch be OK for 3.1.1 branch? It fixes serious > misscompilation. Not really regression, since extra switch is needed, but > that switch seems to be popular. Which extra switch? I couldn't see it in the code, but I can never remember all the switches. :-) If this patch has been applied on the mainline and no problems have been reported, then it is OK for the branch. Thanks, -- Mark Mitchell mark@codesourcery.com CodeSourcery, LLC http://www.codesourcery.com ^ permalink raw reply [flat|nested] 33+ messages in thread
* GCC performance regression - up to 20% ? @ 2002-04-20 18:13 Michel LESPINASSE 2002-04-22 14:33 ` GCC performance regression - its memset ! Michel LESPINASSE 0 siblings, 1 reply; 33+ messages in thread From: Michel LESPINASSE @ 2002-04-20 18:13 UTC (permalink / raw) To: gcc list Hi, I have downloaded the latest 3.1 snapshot (20020415) and ran some performance tests. So far I've been impressed by the FP performance, but kinda disappointed by the integer performance. The benchmarks I've run are two libraries I maintain, libmpeg2 and liba52. These are used by several open-source dvd players, and are quite CPU intensive (especially libmpeg2). So here are my results, using gcc 2.95 as a reference: First the good news: liba52 (mostly FP intensive workload) on athlon tbird 950, using -mcpu=pentiumpro: gcc-3.0 is between 4.5% and 6.5% faster than 2.95.4 depending on streams gcc-3.1 snapshot is between 8% and 9.5% faster than 2.95.4 from these measurements 3.1 has a very nice performance, very close to intel's icc. Great work ! Also using -march=athlon-tbird and generating sse code, I can get yet a few extra % of performance. Now the bad news: for libmepg2, which is an integer-only workload, I get a 10% to 20% performance regression between 2.95.4 and 3.1... 3.0 was already slower than 2.95.4, but 3.1 seems to be worse for this workload at least. libmpeg2, on athlon tbird 950, with mmx optimizations: gcc-3.0 is about 2% slower than 2.95.4 gcc-3.1 snapshot is about 10% slower than 2.95.4 libmpeg2, on athlon tbird 950, using pure C code: gcc-3.0 is about 4.5% slower than 2.95.4 gcc-3.1 snapshot is about 5.5% slower than 2.95.4 libmpeg2, on celeron 366, with mmx optimizations: gcc-3.0 is about 4% slower than 2.95.4 gcc-3.1 snapshot is about 20.5% slower than 2.95.4 (!!!!) These results are all very repeatable. the celeron 366 results are the most worrying, as this processor already has borderline performance for decoding mpeg2 streams. Is there a known performance regression in current GCCs (say, do they get lower SPECint scores ?) or is it only with my code ? Also, is there anything I could do in my code to enhance performance with newer gcc versions ? One thing I noticed is that 3.1 snapshot produces less inlining than 3.0 or 2.95. This probably accounts for some of the slowdown I see when using mmx optimizations, as my mmx routines are written using a few routines that I really expect to get inlined. Is there any way I can get back control about that, so that gcc honours the inline keyword ? I have not managed to do this either. BTW, these two apps I mentionned can be found at http://libmpeg2.sourceforge.net/ http://liba52.sourceforge.net/ Puzzled, -- Michel "Walken" LESPINASSE Is this the best that god can do ? Then I'm not impressed. ^ permalink raw reply [flat|nested] 33+ messages in thread
* GCC performance regression - its memset ! 2002-04-20 18:13 GCC performance regression - up to 20% ? Michel LESPINASSE @ 2002-04-22 14:33 ` Michel LESPINASSE 2002-04-22 14:58 ` Jason R Thorpe 2002-04-22 17:10 ` Richard Henderson 0 siblings, 2 replies; 33+ messages in thread From: Michel LESPINASSE @ 2002-04-22 14:33 UTC (permalink / raw) To: gcc list OK, so I worked more to find the cause of the slowdown, and I figured out its all because of memset(). This function seems to be about twice slower than in 2.95, and also for some reason the time spent in memset does not show up in gprof. Here is a test case: --------------------------- foo.c ------------------------------ #include <string.h> short table[64]; void bar (void); int main (void) { int i; bar (); for (i = 0; i < 100000000; i++) memset (table + 1, 0, 63 * sizeof(short)); return 0; } ----------------------------- end of foo.c ------------------------ ----------------------------- bar.c ------------------------------- void bar (void) { } ----------------------------- end of bar.c ------------------------ # gcc-2.95 -g -O3 -p foo.c bar.c # time ./a.out ./a.out 5.75s user 0.00s system 100% cpu 5.739 total # gprof -bp ./a.out Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls Ts/call Ts/call name 100.00 5.74 5.74 main 0.00 5.74 0.00 1 0.00 0.00 bar # gcc-3.1 -g -O3 -p foo.c bar.c # time ./a.out ./a.out 10.78s user 0.00s system 101% cpu 10.634 total # gprof -bp ./a.out Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls Ts/call Ts/call name 100.00 0.62 0.62 main 0.00 0.62 0.00 1 0.00 0.00 bar gcc-3.1 snapshot is about twice slower than 2.95 on that test case, and for some reason the gprof output is bogus (it does not account for the time spent in memset), while it was not with 2.95. I did not know my code spent that much time in memset, I'll see what I can do about it. Hope this helps, -- Michel "Walken" LESPINASSE Is this the best that god can do ? Then I'm not impressed. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: GCC performance regression - its memset ! 2002-04-22 14:33 ` GCC performance regression - its memset ! Michel LESPINASSE @ 2002-04-22 14:58 ` Jason R Thorpe 2002-04-22 15:27 ` Michel LESPINASSE 2002-04-22 16:59 ` Segher Boessenkool 2002-04-22 17:10 ` Richard Henderson 1 sibling, 2 replies; 33+ messages in thread From: Jason R Thorpe @ 2002-04-22 14:58 UTC (permalink / raw) To: Michel LESPINASSE; +Cc: gcc list On Mon, Apr 22, 2002 at 02:32:22PM -0700, Michel LESPINASSE wrote: > gcc-3.1 snapshot is about twice slower than 2.95 on that test case, > and for some reason the gprof output is bogus (it does not account for > the time spent in memset), while it was not with 2.95. gprof doesn't see it because gcc is doing the memset inline, presumably; it does this in certain cases where it knows the size at compile time. Try running your test with -fno-builtin. (I'm not suggesting this as a fix for your performance issue, just as an explanation of why memset() is invisible to gprof in your testcase). -- -- Jason R. Thorpe <thorpej@wasabisystems.com> ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: GCC performance regression - its memset ! 2002-04-22 14:58 ` Jason R Thorpe @ 2002-04-22 15:27 ` Michel LESPINASSE 2002-04-22 16:59 ` Segher Boessenkool 1 sibling, 0 replies; 33+ messages in thread From: Michel LESPINASSE @ 2002-04-22 15:27 UTC (permalink / raw) To: Jason R Thorpe, gcc list On Mon, Apr 22, 2002 at 02:41:25PM -0700, Jason R Thorpe wrote: > gprof doesn't see it because gcc is doing the memset inline, presumably; it > does this in certain cases where it knows the size at compile time. I'm not surprised that there is no line for memset in the flat profile. But, I'm surprised that the cumulative time in the flat profile is not equal to the user time in the time command. 'normal' inlines (i.e. the ones I could write by declaring an inline function) are accounted normally by gprof (i.e. they do show up in the cumulative time). -- Michel "Walken" LESPINASSE Is this the best that god can do ? Then I'm not impressed. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: GCC performance regression - its memset ! 2002-04-22 14:58 ` Jason R Thorpe 2002-04-22 15:27 ` Michel LESPINASSE @ 2002-04-22 16:59 ` Segher Boessenkool 1 sibling, 0 replies; 33+ messages in thread From: Segher Boessenkool @ 2002-04-22 16:59 UTC (permalink / raw) To: thorpej; +Cc: Michel LESPINASSE, gcc list Jason R Thorpe wrote: > > On Mon, Apr 22, 2002 at 02:32:22PM -0700, Michel LESPINASSE wrote: > > > gcc-3.1 snapshot is about twice slower than 2.95 on that test case, > > and for some reason the gprof output is bogus (it does not account for > > the time spent in memset), while it was not with 2.95. > > gprof doesn't see it because gcc is doing the memset inline, presumably; it > does this in certain cases where it knows the size at compile time. > > Try running your test with -fno-builtin. (I'm not suggesting this as a > fix for your performance issue, just as an explanation of why memset() is > invisible to gprof in your testcase). Erm, no. In the 2.95 case, GCC _did_ inline the memset(); that's why it showed up as 5.74 seconds in main(). In the 3.1 case, it was not inlined; gprof doesn't show you the time spent in libc, as libc is not compiled with profiling enabled. Inlined functions are never profiled separately. Linking statically will make gprof show you the time spent in shared library functions (but not the call graph to those); or you can link against a libc that was compiled with profiling enabled, so you get a call graph as well. Segher ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: GCC performance regression - its memset ! 2002-04-22 14:33 ` GCC performance regression - its memset ! Michel LESPINASSE 2002-04-22 14:58 ` Jason R Thorpe @ 2002-04-22 17:10 ` Richard Henderson 2002-04-22 17:13 ` Michel LESPINASSE 1 sibling, 1 reply; 33+ messages in thread From: Richard Henderson @ 2002-04-22 17:10 UTC (permalink / raw) To: Michel LESPINASSE; +Cc: gcc list On Mon, Apr 22, 2002 at 02:32:22PM -0700, Michel LESPINASSE wrote: > OK, so I worked more to find the cause of the slowdown, and I figured > out its all because of memset(). See whether or not disabling glibc's inline expansion of memset affects 2.95 vs 3.x with -D__NO_STRING_INLINES. r~ ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: GCC performance regression - its memset ! 2002-04-22 17:10 ` Richard Henderson @ 2002-04-22 17:13 ` Michel LESPINASSE 2002-04-22 17:39 ` Richard Henderson 2002-04-23 2:39 ` Jan Hubicka 0 siblings, 2 replies; 33+ messages in thread From: Michel LESPINASSE @ 2002-04-22 17:13 UTC (permalink / raw) To: Richard Henderson, gcc list On Mon, Apr 22, 2002 at 04:59:53PM -0700, Richard Henderson wrote: > See whether or not disabling glibc's inline expansion of > memset affects 2.95 vs 3.x with -D__NO_STRING_INLINES. I can get gcc 2.95 to be as slow as 3.1 snapshot by using both -fno-builtin and -D__NO_STRING_INLINES but, I cant get gcc 3.1 to do the inlining for me If I dont use -fno-builtin -D__NO_STRING_INLINES, and just compile with -O3, gcc 2.95 generates this: movl $table+2,%edi cld movl $31,%ecx rep stosl stosw while 3.1 snapshot generates this: movl $table+2, (%esp) movl $0, 4(%esp) movl $126, 8(%esp) call memset Hope this helps, -- Michel "Walken" LESPINASSE Is this the best that god can do ? Then I'm not impressed. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: GCC performance regression - its memset ! 2002-04-22 17:13 ` Michel LESPINASSE @ 2002-04-22 17:39 ` Richard Henderson 2002-04-22 17:49 ` Michel LESPINASSE 2002-04-23 2:39 ` Jan Hubicka 1 sibling, 1 reply; 33+ messages in thread From: Richard Henderson @ 2002-04-22 17:39 UTC (permalink / raw) To: Michel LESPINASSE; +Cc: gcc list On Mon, Apr 22, 2002 at 05:10:45PM -0700, Michel LESPINASSE wrote: > I can get gcc 2.95 to be as slow as 3.1 snapshot by using both > -fno-builtin and -D__NO_STRING_INLINES I wasn't interested in -fno-builtin, but only differences seen when both 2.95 and 3.1 are using -D__NO_STRING_INLINES. r~ ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: GCC performance regression - its memset ! 2002-04-22 17:39 ` Richard Henderson @ 2002-04-22 17:49 ` Michel LESPINASSE 2002-04-23 5:03 ` Falk Hueffner 0 siblings, 1 reply; 33+ messages in thread From: Michel LESPINASSE @ 2002-04-22 17:49 UTC (permalink / raw) To: Richard Henderson, gcc list On Mon, Apr 22, 2002 at 05:13:35PM -0700, Richard Henderson wrote: > I wasn't interested in -fno-builtin, but only differences seen > when both 2.95 and 3.1 are using -D__NO_STRING_INLINES. If I use just -O3 -D__NO_STRING_INLINES, I get the same behaviour as with just -O3 (i.e. 2.95 inlines the memset and 3.1 doesnt) -- Michel "Walken" LESPINASSE Is this the best that god can do ? Then I'm not impressed. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: GCC performance regression - its memset ! 2002-04-22 17:49 ` Michel LESPINASSE @ 2002-04-23 5:03 ` Falk Hueffner 2002-04-23 6:53 ` Andreas Schwab 0 siblings, 1 reply; 33+ messages in thread From: Falk Hueffner @ 2002-04-23 5:03 UTC (permalink / raw) To: gcc list Hi, while we're at it, when you have a prototype for memset, gcc forgets about the alignment: void f1(unsigned long *p) { memset(p, 0, 16); } #include <string.h> void f2(unsigned long *p) { memset(p, 0, 16); } -> (Alpha) 0000000000000000 <f1>: 0: 08 00 f0 b7 stq zero,8(a0) 4: 1f 04 ff 5f fnop 8: 00 00 f0 b7 stq zero,0(a0) c: 01 80 fa 6b ret 0000000000000020 <f2>: 20: 0f 00 30 2c ldq_u t0,15(a0) 24: 00 00 50 2c ldq_u t1,0(a0) 28: 41 0e 30 48 mskqh t0,a0,t0 2c: 0f 00 30 3c stq_u t0,15(a0) 30: 42 06 50 48 mskql t1,a0,t1 34: 08 00 f0 3f stq_u zero,8(a0) 38: 00 00 50 3c stq_u t1,0(a0) 3c: 01 80 fa 6b ret Is there anything that can be done about that? Falk ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: GCC performance regression - its memset ! 2002-04-23 5:03 ` Falk Hueffner @ 2002-04-23 6:53 ` Andreas Schwab 0 siblings, 0 replies; 33+ messages in thread From: Andreas Schwab @ 2002-04-23 6:53 UTC (permalink / raw) To: Falk Hueffner; +Cc: gcc list Falk Hueffner <falk.hueffner@student.uni-tuebingen.de> writes: |> Hi, |> |> while we're at it, when you have a prototype for memset, gcc forgets |> about the alignment: |> |> void f1(unsigned long *p) { memset(p, 0, 16); } |> #include <string.h> |> void f2(unsigned long *p) { memset(p, 0, 16); } |> |> -> (Alpha) |> |> 0000000000000000 <f1>: |> 0: 08 00 f0 b7 stq zero,8(a0) |> 4: 1f 04 ff 5f fnop |> 8: 00 00 f0 b7 stq zero,0(a0) |> c: 01 80 fa 6b ret |> 0000000000000020 <f2>: |> 20: 0f 00 30 2c ldq_u t0,15(a0) |> 24: 00 00 50 2c ldq_u t1,0(a0) |> 28: 41 0e 30 48 mskqh t0,a0,t0 |> 2c: 0f 00 30 3c stq_u t0,15(a0) |> 30: 42 06 50 48 mskql t1,a0,t1 |> 34: 08 00 f0 3f stq_u zero,8(a0) |> 38: 00 00 50 3c stq_u t1,0(a0) |> 3c: 01 80 fa 6b ret |> |> Is there anything that can be done about that? Please try -D__NO_STRING_INLINES. Andreas. -- Andreas Schwab, SuSE Labs, schwab@suse.de SuSE GmbH, Deutschherrnstr. 15-19, D-90429 Nürnberg Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: GCC performance regression - its memset ! 2002-04-22 17:13 ` Michel LESPINASSE 2002-04-22 17:39 ` Richard Henderson @ 2002-04-23 2:39 ` Jan Hubicka 2002-04-23 13:36 ` Michel LESPINASSE 1 sibling, 1 reply; 33+ messages in thread From: Jan Hubicka @ 2002-04-23 2:39 UTC (permalink / raw) To: Michel LESPINASSE; +Cc: Richard Henderson, gcc list > On Mon, Apr 22, 2002 at 04:59:53PM -0700, Richard Henderson wrote: > > See whether or not disabling glibc's inline expansion of > > memset affects 2.95 vs 3.x with -D__NO_STRING_INLINES. > > I can get gcc 2.95 to be as slow as 3.1 snapshot by using both > -fno-builtin and -D__NO_STRING_INLINES > > but, I cant get gcc 3.1 to do the inlining for me > > If I dont use -fno-builtin -D__NO_STRING_INLINES, and just compile with -O3, > gcc 2.95 generates this: > > movl $table+2,%edi > cld > movl $31,%ecx > rep > stosl > stosw > > while 3.1 snapshot generates this: > > movl $table+2, (%esp) > movl $0, 4(%esp) > movl $126, 8(%esp) > call memset I guess the inlining threshold is too low or the default memset implementation too lame. I was tunning it for Athlon, so the mileage may warry from CPU to CPU. I will investigate the misscompilation first and check this second. Honza > > Hope this helps, > > -- > Michel "Walken" LESPINASSE > Is this the best that god can do ? Then I'm not impressed. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: GCC performance regression - its memset ! 2002-04-23 2:39 ` Jan Hubicka @ 2002-04-23 13:36 ` Michel LESPINASSE 2002-04-24 0:30 ` Jan Hubicka 2002-04-24 3:32 ` Jan Hubicka 0 siblings, 2 replies; 33+ messages in thread From: Michel LESPINASSE @ 2002-04-23 13:36 UTC (permalink / raw) To: Jan Hubicka; +Cc: Richard Henderson, gcc list On Tue, Apr 23, 2002 at 11:25:40AM +0200, Jan Hubicka wrote: > I guess the inlining threshold is too low or the default memset > implementation too lame. I was tunning it for Athlon, so the > mileage may warry from CPU to CPU. I will investigate the > misscompilation first and check this second. > Concerning the inlining, gcc inlines all memcpys with size smaller > than 64 bytes. Perhaps this should be extended to 128 bytes in case > we are still about 2 times as bad. This is partly due to lame > implementation of memset in glibc too :( When gcc does the inlining, performance seems to not be so bad. There is probably still some untapped performance though, as some of the initial and final alignment checks could be ommited when gcc already knows about the alignment of the memory zone (like in my test case, it was an array of shorts in the data segment, so it was known to be on a two-byte boundary at least). But might be hard to code into gcc, I dont know. Also as I've been only giving bad news up to now, I wanted to say that now that I've worked around the two issues I had with inlining and with memset, the 3.1 snapshot does provide superior performance on my libmpeg2 codebase, about 5% faster than 2.95.4, and that gets up to 8% when using -fbranch-probabilities and 9% when using -mcpu=athlon-tbird instead of the more generic -mcpu=pentiumpro. Nice work guys ! I am still worried though, that other people will have the same trouble with inlining as I did and not see all of the performance improvements as a result. Cheers, -- Michel "Walken" LESPINASSE Is this the best that god can do ? Then I'm not impressed. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: GCC performance regression - its memset ! 2002-04-23 13:36 ` Michel LESPINASSE @ 2002-04-24 0:30 ` Jan Hubicka 2002-04-24 0:50 ` Jakub Jelinek 2002-04-24 3:32 ` Jan Hubicka 1 sibling, 1 reply; 33+ messages in thread From: Jan Hubicka @ 2002-04-24 0:30 UTC (permalink / raw) To: Michel LESPINASSE; +Cc: Jan Hubicka, Richard Henderson, gcc list > On Tue, Apr 23, 2002 at 11:25:40AM +0200, Jan Hubicka wrote: > > I guess the inlining threshold is too low or the default memset > > implementation too lame. I was tunning it for Athlon, so the > > mileage may warry from CPU to CPU. I will investigate the > > misscompilation first and check this second. > > > Concerning the inlining, gcc inlines all memcpys with size smaller > > than 64 bytes. Perhaps this should be extended to 128 bytes in case > > we are still about 2 times as bad. This is partly due to lame > > implementation of memset in glibc too :( > > When gcc does the inlining, performance seems to not be so bad. There > is probably still some untapped performance though, as some of the > initial and final alignment checks could be ommited when gcc already > knows about the alignment of the memory zone (like in my test case, it When it knows, it should avoid it. Definitly on array of shorts, the alignment to even byte is not done. It is dificult to make it expect that array of shorts is 4 byte aligned, as ABI does not specify this, so it may not be. GCC has new alignment tracking code, so it should be better than any previous version, but still not that good. (for instnace when array is static, it definitly do have chance to conclude so, but it does not, however majority of string functions come to computed addresses) > was an array of shorts in the data segment, so it was known to be on a > two-byte boundary at least). But might be hard to code into gcc, I > dont know. > > Also as I've been only giving bad news up to now, I wanted to say that > now that I've worked around the two issues I had with inlining and > with memset, the 3.1 snapshot does provide superior performance on my > libmpeg2 codebase, about 5% faster than 2.95.4, and that gets up to 8% > when using -fbranch-probabilities and 9% when using -mcpu=athlon-tbird That sounds good :) > instead of the more generic -mcpu=pentiumpro. Nice work guys ! I am > still worried though, that other people will have the same trouble > with inlining as I did and not see all of the performance improvements > as a result. I will send patch to increase the constant to 128. I was re-benchmarking the code and on P4/Athlon and my assembly memset, the 64 is just on the border (ie inlined/not inlined sollution have less than 10% difference), setting it to 128 does not make us to loose something. For glibc implementation 128 is still a win to be inlined :( Honza > > Cheers, > > -- > Michel "Walken" LESPINASSE > Is this the best that god can do ? Then I'm not impressed. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: GCC performance regression - its memset ! 2002-04-24 0:30 ` Jan Hubicka @ 2002-04-24 0:50 ` Jakub Jelinek 2002-04-24 1:00 ` Jan Hubicka 0 siblings, 1 reply; 33+ messages in thread From: Jakub Jelinek @ 2002-04-24 0:50 UTC (permalink / raw) To: Jan Hubicka; +Cc: Michel LESPINASSE, Richard Henderson, gcc list On Wed, Apr 24, 2002 at 09:17:48AM +0200, Jan Hubicka wrote: > > instead of the more generic -mcpu=pentiumpro. Nice work guys ! I am > > still worried though, that other people will have the same trouble > > with inlining as I did and not see all of the performance improvements > > as a result. > > I will send patch to increase the constant to 128. I was re-benchmarking > the code and on P4/Athlon and my assembly memset, the 64 is just on the border > (ie inlined/not inlined sollution have less than 10% difference), setting > it to 128 does not make us to loose something. For glibc implementation > 128 is still a win to be inlined :( So please contribute it to glibc then... Is yours an Athlon optimized memset or does it perform better on P3/P4 too? Jakub ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: GCC performance regression - its memset ! 2002-04-24 0:50 ` Jakub Jelinek @ 2002-04-24 1:00 ` Jan Hubicka 0 siblings, 0 replies; 33+ messages in thread From: Jan Hubicka @ 2002-04-24 1:00 UTC (permalink / raw) To: Jakub Jelinek; +Cc: Jan Hubicka, Michel LESPINASSE, Richard Henderson, gcc list > On Wed, Apr 24, 2002 at 09:17:48AM +0200, Jan Hubicka wrote: > > > instead of the more generic -mcpu=pentiumpro. Nice work guys ! I am > > > still worried though, that other people will have the same trouble > > > with inlining as I did and not see all of the performance improvements > > > as a result. > > > > I will send patch to increase the constant to 128. I was re-benchmarking > > the code and on P4/Athlon and my assembly memset, the 64 is just on the border > > (ie inlined/not inlined sollution have less than 10% difference), setting > > it to 128 does not make us to loose something. For glibc implementation > > 128 is still a win to be inlined :( > > So please contribute it to glibc then... > Is yours an Athlon optimized memset or does it perform better on P3/P4 too? It is fairly simple memset (for small cases only what matter is whether memset do have fast path trought), but I do have Athlon optimized as well. The problem with glibc is that it needs machinery for Athlon specific stuff and AFAIK glibc maintainers didn't decided how to do that. Honza > > Jakub ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: GCC performance regression - its memset ! 2002-04-23 13:36 ` Michel LESPINASSE 2002-04-24 0:30 ` Jan Hubicka @ 2002-04-24 3:32 ` Jan Hubicka 1 sibling, 0 replies; 33+ messages in thread From: Jan Hubicka @ 2002-04-24 3:32 UTC (permalink / raw) To: Michel LESPINASSE; +Cc: Jan Hubicka, Richard Henderson, gcc list > Also as I've been only giving bad news up to now, I wanted to say that > now that I've worked around the two issues I had with inlining and > with memset, the 3.1 snapshot does provide superior performance on my > libmpeg2 codebase, about 5% faster than 2.95.4, and that gets up to 8% > when using -fbranch-probabilities and 9% when using -mcpu=athlon-tbird > instead of the more generic -mcpu=pentiumpro. Nice work guys ! I am Just note that -mcpu=pentiumpro is not generic at all. It optimizes for pentiumpro and (in turn P2/P3), but nothing else. Perhaps we can have switch to generate code working well on the modern CPUs, but we don't Honza > still worried though, that other people will have the same trouble > with inlining as I did and not see all of the performance improvements > as a result. > > Cheers, > > -- > Michel "Walken" LESPINASSE > Is this the best that god can do ? Then I'm not impressed. ^ permalink raw reply [flat|nested] 33+ messages in thread
end of thread, other threads:[~2002-05-21 17:21 UTC | newest] Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2002-04-22 23:07 GCC performance regression - its memset! Roger Sayle 2002-04-22 23:30 ` Michel LESPINASSE 2002-04-23 0:45 ` Michel LESPINASSE 2002-04-23 2:53 ` Jan Hubicka 2002-04-23 3:28 ` Richard Henderson 2002-05-20 8:06 ` [3.1.1] " Jan Hubicka 2002-05-20 9:36 ` Glen Nakamura 2002-05-20 10:08 ` Richard Henderson 2002-05-20 10:38 ` Jakub Jelinek 2002-05-20 11:28 ` Roger Sayle 2002-05-20 12:57 ` Glen Nakamura 2002-05-20 16:58 ` Roger Sayle 2002-05-21 8:23 ` Jack Lloyd 2002-05-21 9:55 ` Glen Nakamura 2002-05-21 11:13 ` Jack Lloyd 2002-05-21 8:04 ` Jan Hubicka 2002-05-20 12:07 ` Mark Mitchell -- strict thread matches above, loose matches on Subject: below -- 2002-04-20 18:13 GCC performance regression - up to 20% ? Michel LESPINASSE 2002-04-22 14:33 ` GCC performance regression - its memset ! Michel LESPINASSE 2002-04-22 14:58 ` Jason R Thorpe 2002-04-22 15:27 ` Michel LESPINASSE 2002-04-22 16:59 ` Segher Boessenkool 2002-04-22 17:10 ` Richard Henderson 2002-04-22 17:13 ` Michel LESPINASSE 2002-04-22 17:39 ` Richard Henderson 2002-04-22 17:49 ` Michel LESPINASSE 2002-04-23 5:03 ` Falk Hueffner 2002-04-23 6:53 ` Andreas Schwab 2002-04-23 2:39 ` Jan Hubicka 2002-04-23 13:36 ` Michel LESPINASSE 2002-04-24 0:30 ` Jan Hubicka 2002-04-24 0:50 ` Jakub Jelinek 2002-04-24 1:00 ` Jan Hubicka 2002-04-24 3:32 ` Jan Hubicka
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).