Is cross-section inlining valid behaviour?

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Is cross-section inlining valid behaviour?
@ 2008-07-23 14:59 Bingfeng Mei
  2008-07-23 15:07 ` Richard Guenther
                   ` (2 more replies)
  0 siblings, 3 replies; 50+ messages in thread
From: Bingfeng Mei @ 2008-07-23 14:59 UTC (permalink / raw)
  To: gcc

Hello, 

I came across a problem related to cross-section inlining. For the
following example, 

static void foo(void) __attribute__((section ("foo")));
 
static void foo(void)
{
  printf("Hello\n");
}
 
void bar(void) __attribute__((section ("bar")));
 
void bar(void)
{
  foo();
}

 
I compiled with the latest mainline gcc. 
gcc tst.c -O3 -S


The foo function is inlined into bar anyway even they have different
section attribute.  Is this a bug or expected behaviour? 
	.file	"tst.c"
	.section	.rodata.str1.1,"aMS",@progbits,1
.LC0:
	.string	"Hello"
	.section	bar,"ax",@progbits
	.p2align 4,,15
.globl bar
	.type	bar, @function
bar:
.LFB3:
	movl	$.LC0, %edi
	jmp	puts
.LFE3:


Thanks. 
Bingfeng Mei

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Is cross-section inlining valid behaviour?
  2008-07-23 14:59 Is cross-section inlining valid behaviour? Bingfeng Mei
@ 2008-07-23 15:07 ` Richard Guenther
  2008-07-23 15:14 ` Dave Korn
  2008-07-23 17:25 ` gcc will become the best optimizing x86 compiler Agner Fog
  2 siblings, 0 replies; 50+ messages in thread
From: Richard Guenther @ 2008-07-23 15:07 UTC (permalink / raw)
  To: Bingfeng Mei; +Cc: gcc

On Wed, Jul 23, 2008 at 4:46 PM, Bingfeng Mei <bmei@broadcom.com> wrote:
> Hello,
>
> I came across a problem related to cross-section inlining. For the
> following example,
>
> static void foo(void) __attribute__((section ("foo")));
>
> static void foo(void)
> {
>  printf("Hello\n");
> }
>
> void bar(void) __attribute__((section ("bar")));
>
> void bar(void)
> {
>  foo();
> }
>
>
> I compiled with the latest mainline gcc.
> gcc tst.c -O3 -S
>
>
> The foo function is inlined into bar anyway even they have different
> section attribute.  Is this a bug or expected behaviour?

This is expected behavior.

Richard.

>        .file   "tst.c"
>        .section        .rodata.str1.1,"aMS",@progbits,1
> .LC0:
>        .string "Hello"
>        .section        bar,"ax",@progbits
>        .p2align 4,,15
> .globl bar
>        .type   bar, @function
> bar:
> .LFB3:
>        movl    $.LC0, %edi
>        jmp     puts
> .LFE3:
>
>
> Thanks.
> Bingfeng Mei
>
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: Is cross-section inlining valid behaviour?
  2008-07-23 14:59 Is cross-section inlining valid behaviour? Bingfeng Mei
  2008-07-23 15:07 ` Richard Guenther
@ 2008-07-23 15:14 ` Dave Korn
  2008-07-23 15:31   ` Bingfeng Mei
  2008-07-23 17:25 ` gcc will become the best optimizing x86 compiler Agner Fog
  2 siblings, 1 reply; 50+ messages in thread
From: Dave Korn @ 2008-07-23 15:14 UTC (permalink / raw)
  To: 'Bingfeng Mei', gcc

Bingfeng Mei wrote on 23 July 2008 15:46:

> The foo function is inlined into bar anyway even they have different
> section attribute.  Is this a bug or expected behaviour?

  Well, I would expect it, but only in the light of knowing how the compiler
works.

  Sections are outside the scope of the C standard, so there is nothing
defined for how they would interact with inlining, but what I'd expect to
happen is that the section attribute would apply to any out-of-line copy of
the function body emitted, and would not apply to the body of the function
where it's inlined into another, because that can't even make any sense.

  You could attribute ((__noinline__) it to prevent it getting inlined at
all, but I don't suppose there's any way of saying "Only inline into other
functions also in the same section".  That might well be a useful new
attribute to invent.

    cheers,
      DaveK
-- 
Can't think of a witty .sigline today....

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: Is cross-section inlining valid behaviour?
  2008-07-23 15:14 ` Dave Korn
@ 2008-07-23 15:31   ` Bingfeng Mei
  0 siblings, 0 replies; 50+ messages in thread
From: Bingfeng Mei @ 2008-07-23 15:31 UTC (permalink / raw)
  To: Dave Korn, gcc

Thanks. I know how to use "noinline" to avoid inlining. Just our
application programmers expect the different sections guarantee that
these functions won't be compiled into same section, therefore should
never be inlined. It took a while for us to find this problem. 

Cheers,
Bingfeng

> -----Original Message-----
> From: Dave Korn [mailto:dave.korn@artimi.com] 
> Sent: 23 July 2008 16:05
> To: Bingfeng Mei; gcc@gcc.gnu.org
> Subject: RE: Is cross-section inlining valid behaviour?
> 
> Bingfeng Mei wrote on 23 July 2008 15:46:
> 
> 
> > The foo function is inlined into bar anyway even they have different
> > section attribute.  Is this a bug or expected behaviour?
> 
>   Well, I would expect it, but only in the light of knowing 
> how the compiler
> works.
> 
>   Sections are outside the scope of the C standard, so there 
> is nothing
> defined for how they would interact with inlining, but what 
> I'd expect to
> happen is that the section attribute would apply to any 
> out-of-line copy of
> the function body emitted, and would not apply to the body of 
> the function
> where it's inlined into another, because that can't even make 
> any sense.
> 
>   You could attribute ((__noinline__) it to prevent it 
> getting inlined at
> all, but I don't suppose there's any way of saying "Only 
> inline into other
> functions also in the same section".  That might well be a useful new
> attribute to invent.
> 
>     cheers,
>       DaveK
> -- 
> Can't think of a witty .sigline today....
> 
> 
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* gcc will become the best optimizing x86 compiler
  2008-07-23 14:59 Is cross-section inlining valid behaviour? Bingfeng Mei
  2008-07-23 15:07 ` Richard Guenther
  2008-07-23 15:14 ` Dave Korn
@ 2008-07-23 17:25 ` Agner Fog
  2008-07-23 17:33   ` Tim Prince
                     ` (2 more replies)
  2 siblings, 3 replies; 50+ messages in thread
From: Agner Fog @ 2008-07-23 17:25 UTC (permalink / raw)
  To: gcc

Hi, I am doing research on optimization of microprocessors and 
compilers. Some of you already know my optimization manuals 
(www.agner.org/optimize/).

I have tested many different compilers and compared how well they 
optimize C++ code. I have been pleased to observe that gcc has been 
improved a lot in the last couple of years. The gcc compiler itself is 
now matching the optimizing performance of the Intel compiler and it 
beats all other compilers I have tested. All you hard-working developers 
deserve credit for this!

I can imagine that gcc might be the compiler of choice for all x86 and 
x86-64 platforms in the future. Actually, the compiler itself is very 
close to being the best, but it appears that the function libraries are 
lacking behind. I have tested a few of the most important functions in 
libc and compared them with other available libraries (MS, Borland, 
Intel, Mac). The comparison does not look good for gnu libc. See my test 
results in http://www.agner.org/optimize/optimizing_cpp.pdf section 2.6. 
The 64-bit version is better than the 32-bit version, though.

The first thing that you can do to improve the performance is to drop 
the builtin versions of memory and string functions. The speed can be 
improved by up to a factor 5 in some cases by compiling with 
-fno-builtin. The builtin version is never optimal, except for memcpy in 
cases where the count is a small compile-time constant so that it can be 
replaced by simple mov instructions.

Next, the function libraries should have CPU-dispatching and use the 
latest instruction sets where appropriate. You are not even using XMM 
registers for memcpy in 64-bit libc.

I think you can borrow code from the Mac/Darwin/Xnu project. They have 
optimized these functions very carefully for the Intel Core and Core 2 
processors. Of course they have the advantage that they don't need to 
support any other processors, whereas gcc has to support every possible 
Intel and AMD processor. This means more CPU-dispatching.

I have made a few optimized functions myself and published them as a 
multi-platform library (www.agner.org/optimize/asmlib.zip). It is faster 
than most other libraries on an Intel Core2 and up to ten times faster 
than gcc using builtin functions. My library is published with GPL 
license, but I will allow you to use my code in gnu libc if you wish 
(Sorry, I don't have the time to work on the gnu project myself, but you 
may contact me for details about the code).

The Windows version of gcc is not up to date, but I think that when gcc 
gets a reputation as the best compiler, more people will be motivated to 
update cygwin/mingw. A lot of people are actually using it.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: gcc will become the best optimizing x86 compiler
  2008-07-23 17:25 ` gcc will become the best optimizing x86 compiler Agner Fog
@ 2008-07-23 17:33   ` Tim Prince
  2008-07-24  8:04   ` Dennis Clarke
  2008-07-24 10:09   ` Zoltán Kócsi
  2 siblings, 0 replies; 50+ messages in thread
From: Tim Prince @ 2008-07-23 17:33 UTC (permalink / raw)
  To: Agner Fog; +Cc: gcc

Agner Fog wrote:
>  I have tested a few of the most important functions in 
> libc and compared them with other available libraries (MS, Borland, 
> Intel, Mac). The comparison does not look good for gnu libc. See my test 
> results in http://www.agner.org/optimize/optimizing_cpp.pdf section 2.6. 
As far as I can see, you identify the library you tested only as "ubuntu 
g++ 4.2.3."  Presumably, that implies some version of glibc?  On my x86-64 
system where I have glibc-2.6.1-18.3, some of the functions perform much 
better than those provided with earlier glibc versions.
Speaking of the one case where I have looked into it, the builtin_memcpy 
of gcc for 32-bit linux uses a string move which performs well only for 
certain cases of short non-aligned strings.  The corresponding 64-bit 
linux will see vastly different levels of performance, depending on the 
glibc version, as it doesn't use a builtin string move.
Certain newer CPUs aim to improve performance of the 32-bit gcc builtin 
string moves, but don't entirely eliminate the situations where it isn't 
optimum.
The machinery for getting good performing versions in glibc isn't visible 
on this list.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: gcc will become the best optimizing x86 compiler
  2008-07-23 17:25 ` gcc will become the best optimizing x86 compiler Agner Fog
  2008-07-23 17:33   ` Tim Prince
@ 2008-07-24  8:04   ` Dennis Clarke
  2008-07-24  9:41     ` Agner Fog
  2008-07-24 10:09   ` Zoltán Kócsi
  2 siblings, 1 reply; 50+ messages in thread
From: Dennis Clarke @ 2008-07-24  8:04 UTC (permalink / raw)
  To: Agner Fog; +Cc: gcc

On Wed, Jul 23, 2008 at 12:15 PM, Agner Fog <agner@agner.org> wrote:
> Hi, I am doing research on optimization of microprocessors and compilers.
> Some of you already know my optimization manuals (www.agner.org/optimize/).

Sorry but I'm not buying.

The Sun Studio 12 compiler with Solaris 10 on AMD Opteron or
UltraSparc beats GCC in almost every single test case that I have
seen.  On the same hardware. Regardless of a single threaded test case
or a multi-threaded test case. The differences do occur with file IO
and with situations where peripherals get involved but for pure number
crunching and pushing data around in heaps of memory I simply have not
seen GCC ever do as well as Sun Studio 12 or even Sun Studio 10.

Also, you have provided no data at all.  So your assertions are those
of a marketing person at the moment.  Please post some code that can
be compiled and then tested with high resolution timers and perhaps we
can compare notes.

Dennis Clarke

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: gcc will become the best optimizing x86 compiler
  2008-07-24  8:04   ` Dennis Clarke
@ 2008-07-24  9:41     ` Agner Fog
  2008-07-24 10:10       ` Dave Korn
  2008-07-24 17:21       ` Raksit Ashok
  0 siblings, 2 replies; 50+ messages in thread
From: Agner Fog @ 2008-07-24  9:41 UTC (permalink / raw)
  To: dclarke; +Cc: gcc, TimothyPrince

Dennis Clarke wrote:
 >The Sun Studio 12 compiler with Solaris 10 on AMD Opteron or
 >UltraSparc beats GCC in almost every single test case that I have
 >seen.

This is memcpy on Solaris:
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/i386/gen/memcpy.s

It uses exactly the same method as memcpy on gcc libc, with only minor 
differences that have no influence on performance.

> Also, you have provided no data at all.  

I have linked to the data rather than copying it here to save space on 
the mailing list. Here is the link again:
http://www.agner.org/optimize/optimizing_cpp.pdf  section 2.6, page 12.

> So your assertions are those of a marketing person at the moment.

Who sounds like a marketing person, you or me? :-)

 > Please post some code that can be compiled and then tested with high 
resolution timers and perhaps
 > we can compare notes.

Here is my code, again:
http://www.agner.org/optimize/asmlib.zip
My test results, referred to above, uses the "core clock cycles" 
performance counter on Intel and RDTSC on AMD. It's the highest 
resolution you can get. Feel free to do you own tests, it's as simple as 
linking my library into your test program.

Tim Prince wrote:
 >you identify the library you tested only as "ubuntu g++ 4.2.3."
Where can I see the libc version?

 >The corresponding 64-bit linux will see vastly different levels of 
performance, depending on the
 >glibc version, as it doesn't use a builtin string move.
Yes, this is exactly what my tests show. 64-bit libc is better than 
32-bit libc, but still 3-4 times slower than the best library for 
unaligned operands on an Intel.

 >Certain newer CPUs aim to improve performance of the 32-bit gcc 
builtin string moves, but don't
 > entirely eliminate the situations where it isn't optimum.

The Intel manuals are not clear about this. Intel Optimization reference 
manual says:
 >In most cases, applications should take advantage of the default 
memory routines provided by Intel compilers.
What an excellent advice - the Intel compiler puts in a library with an 
automatic run-slowly-on-AMD feature!
The Intel library does not use rep movs when running on an Intel CPU.

The AMD software optimization guide mentions specific situations where 
rep movs is optimal. However, my tests on an Opteron (K8) tell that rep 
movs is never optimal on AMD either. I have no access to test it on the 
new AMD K10, but I expect the XMM register code to run much faster on 
K10 than on K8 because K10 has 128-bit data paths where K8 has only 64-bit.

Evidently, the problem with memcpy has been ignored for years, see 
http://softwarecommunity.intel.com/Wiki/Linux/719.htm

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: gcc will become the best optimizing x86 compiler
  2008-07-23 17:25 ` gcc will become the best optimizing x86 compiler Agner Fog
  2008-07-23 17:33   ` Tim Prince
  2008-07-24  8:04   ` Dennis Clarke
@ 2008-07-24 10:09   ` Zoltán Kócsi
  2 siblings, 0 replies; 50+ messages in thread
From: Zoltán Kócsi @ 2008-07-24 10:09 UTC (permalink / raw)
  To: gcc

> [...]
> I have made a few optimized functions myself and published them as a 
> multi-platform library (www.agner.org/optimize/asmlib.zip). It is
> faster than most other libraries on an Intel Core2 and up to ten
> times faster than gcc using builtin functions. My library is
> published with GPL license, but I will allow you to use my code in
> gnu libc if you wish (Sorry, I don't have the time to work on the gnu
> project myself, but you may contact me for details about the code).
> [...]

But then it's not gcc that is the best optimising compiler, but it's 
the best library *hand optimised so that gcc compiles it very well*.

Here's an example:

void foo( void )
{
unsigned x;

    for ( x = 0 ; x < 200 ; x++ ) func();
}

void bar( void )
{
unsigned x;

    for ( x = 201 ; --x ; ) func();
}

foo() and bar() are completely equivalent, they call func() 200
times and that's all. Yet, if you compile them with -O3 for arm-elf
target with version 4.0.2 (yes, I know, it's an ancient version, but
still) bar() will be 6 insns long with the loop itself being 3 while
foo() compiles to 7 insns of which 4 is the loop. In fact, the compiler
is clever enough to transform bar()'s loop from

    for ( x = 201 ; --x ; ) func();
to
    x = 200; do func() while ( --x );

internally, the latter form being shorter to evaluate and since x is
not used other than as the loop counter it doesn't matter. However, it
is not clever enough to figure out that foo()'s loop is doing exactly
what bar()'s is doing. Since x is only the loop counter, gcc could
transform foo()'s loop to bar()'s freely but it doesn't. It generates
the equivalent of this:

    x = 0; do { x += 1; func(); } while ( x != 240 );

that is not as efficient as what it generates from bar()'s code.

Of course you get surprised when you change -O3 to -Os, in which case
gcc suddenly realises that foo() can indeed be transformed to the
internal representation that it used for bar() with -O3. Thus, we have
foo() now being only 6 insns long with a 3 insn loop. Unfortunately,
bar() is not that lucky. Although it's loop remains 3 insns long, the
entire function is increased by an additional instruction, for bar()
internally now looks like this:

   x = 201;
   goto label;
   do {
      func();
label:
   } while ( --x );

You can play with gcc and see which one of the equivalent C
constructs it compiles to better code with any particular -O level
(and if you have to work  with severely constrained embedded systems
you often do) but then hand-crafting your C code to fit gcc's taste is
actually not that good an idea. With the next release, when different
constructs will be recognised, you may end up with larger and/or slower
code (as it happened to me when changing 4.0.x -> 4.3.x and before when
going from 2.9.x to 3.1.x).

Gcc will be the best optimising compiler when it will generate
faster/shorter code that the other compilers on the majority of
a large set of arbitrary, *not* hand-optimised sources. Preferrably 
for most targets, not only for the x86, if possible :-)

Zoltan

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: gcc will become the best optimizing x86 compiler
  2008-07-24  9:41     ` Agner Fog
@ 2008-07-24 10:10       ` Dave Korn
  2008-07-24 13:20         ` Basile STARYNKEVITCH
  2008-07-24 17:21       ` Raksit Ashok
  1 sibling, 1 reply; 50+ messages in thread
From: Dave Korn @ 2008-07-24 10:10 UTC (permalink / raw)
  To: 'Agner Fog', dclarke; +Cc: gcc, TimothyPrince

Agner Fog wrote on 24 July 2008 09:04:

> Tim Prince wrote:
>  >you identify the library you tested only as "ubuntu g++ 4.2.3."
> Where can I see the libc version?

  Use whichever package manager ubuntu provides to check the version of the
glibc package.  Here's an example fron a centos (using rpm):

[dk@quattro ~]$ rpm -q glibc
glibc-2.3.4-2.36
glibc-2.3.4-2.36
[dk@quattro ~]$ 


    cheers,
      DaveK
-- 
Can't think of a witty .sigline today....

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: gcc will become the best optimizing x86 compiler
  2008-07-24 10:10       ` Dave Korn
@ 2008-07-24 13:20         ` Basile STARYNKEVITCH
  2008-07-24 13:31           ` Dave Korn
                             ` (2 more replies)
  0 siblings, 3 replies; 50+ messages in thread
From: Basile STARYNKEVITCH @ 2008-07-24 13:20 UTC (permalink / raw)
  Cc: 'Agner Fog', gcc, TimothyPrince

Dave Korn wrote:
> Agner Fog wrote on 24 July 2008 09:04:
> 
>> Tim Prince wrote:
>>  >you identify the library you tested only as "ubuntu g++ 4.2.3."
>> Where can I see the libc version?
> 
>   Use whichever package manager ubuntu provides to check the version of the
> glibc package.  Here's an example fron a centos (using rpm):


On most Linux systems, in addition of using the package manager, the 
libc.so file is executable, and when executed, shows info, so on my 
Debian/Sid/AMD64 I'm getting

  % /lib/libc.so.6
GNU C Library stable release version 2.7, by Roland McGrath et al.
Copyright (C) 2007 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 4.3.1 20080523 (prerelease).
Compiled on a Linux >>2.6.25-2-amd64<< system on 2008-06-02.
Available extensions:
         crypt add-on version 2.1 by Michael Glad and others
         GNU Libidn by Simon Josefsson
         Native POSIX Threads Library by Ulrich Drepper et al
         BIND-8.2.3-T5B
For bug reporting instructions, please see:
<http://www.gnu.org/software/libc/bugs.html>.


Of course, on Ubuntu & Debian, you can query the package system
% dpkg -l libc6
Desired=Unknown/Install/Remove/Purge/Hold
| 
Status=Not/Inst/Cfg-files/Unpacked/Failed-cfg/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Hold/Reinst-required/X=both-problems (Status,Err: 
uppercase=bad)
||/ Name           Version        Description
+++-==============-==============-============================================
ii  libc6          2.7-12         GNU C Library: Shared libraries

Regarding the original thread (performance of GCC & standard functions) 
it should be stressed that gcc would probably compile them better if 
passed machine specific flags.

At last, at the recent (july 2008) GCC summit, someone (sorry I forgot 
who, probably someone from SuSE) proposed in a BOFS to have architecture 
and machine specific hand-tuned (or even hand-written assembly) low 
level libraries for such basic things as memset etc..

Thanks for reading
-- 
Basile STARYNKEVITCH         http://starynkevitch.net/Basile/
email: basile<at>starynkevitch<dot>net mobile: +33 6 8501 2359
8, rue de la Faiencerie, 92340 Bourg La Reine, France
*** opinions {are only mines, sont seulement les miennes} ***

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: gcc will become the best optimizing x86 compiler
  2008-07-24 13:20         ` Basile STARYNKEVITCH
@ 2008-07-24 13:31           ` Dave Korn
  2008-07-24 13:59           ` Agner Fog
  2008-07-24 15:02           ` Joseph S. Myers
  2 siblings, 0 replies; 50+ messages in thread
From: Dave Korn @ 2008-07-24 13:31 UTC (permalink / raw)
  To: 'Basile STARYNKEVITCH'; +Cc: 'Agner Fog', gcc, TimothyPrince

Basile STARYNKEVITCH wrote on 24 July 2008 11:28:

> On most Linux systems, in addition of using the package manager, the
> libc.so file is executable, and when executed, shows info, so on my
> Debian/Sid/AMD64 I'm getting
> 
>   % /lib/libc.so.6
> GNU C Library stable release version 2.7, by Roland McGrath et al. 
[snip!]

  oooh, nice - that's loads more informative than "This program cannot be
run in DOS mode"  ;-)

  Thanks for the tip!

    cheers,
      DaveK
-- 
Can't think of a witty .sigline today....

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: gcc will become the best optimizing x86 compiler
  2008-07-24 13:20         ` Basile STARYNKEVITCH
  2008-07-24 13:31           ` Dave Korn
@ 2008-07-24 13:59           ` Agner Fog
  2008-07-24 14:40             ` Richard Guenther
  2008-07-28 10:57             ` Andrew Haley
  2008-07-24 15:02           ` Joseph S. Myers
  2 siblings, 2 replies; 50+ messages in thread
From: Agner Fog @ 2008-07-24 13:59 UTC (permalink / raw)
  To: Basile STARYNKEVITCH; +Cc: gcc, TimothyPrince

Basile STARYNKEVITCH wrote:
 >At last, at the recent (july 2008) GCC summit, someone (sorry I forgot 
who, probably someone from SuSE)
 > proposed in a BOFS to have architecture and machine specific 
hand-tuned (or even hand-written assembly) low
 > level libraries for such basic things as memset etc..

That's exactly what I meant. The most important memory, string and math 
functions should use hand-tuned assembly with CPU dispatching for the 
latest instruction sets. My experiments show that the speed can be 
improved by a factor 3 - 10 for unaligned memcpy on Intel processors 
(http://www.agner.org/optimize/optimizing_cpp.pdf page 12).

There will be more hand-tuning work to do when the 256-bit YMM registes 
become available in a few years - and more to gain in speed.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: gcc will become the best optimizing x86 compiler
  2008-07-24 13:59           ` Agner Fog
@ 2008-07-24 14:40             ` Richard Guenther
  2008-07-28 10:57             ` Andrew Haley
  1 sibling, 0 replies; 50+ messages in thread
From: Richard Guenther @ 2008-07-24 14:40 UTC (permalink / raw)
  To: Agner Fog; +Cc: Basile STARYNKEVITCH, gcc, TimothyPrince

On Thu, Jul 24, 2008 at 3:28 PM, Agner Fog <agner@agner.org> wrote:
> Basile STARYNKEVITCH wrote:
>>At last, at the recent (july 2008) GCC summit, someone (sorry I forgot who,
>> probably someone from SuSE)

That was me and Michael Matz.

Richard.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: gcc will become the best optimizing x86 compiler
  2008-07-24 13:20         ` Basile STARYNKEVITCH
  2008-07-24 13:31           ` Dave Korn
  2008-07-24 13:59           ` Agner Fog
@ 2008-07-24 15:02           ` Joseph S. Myers
  2008-07-24 16:26             ` Agner Fog
  2008-07-24 17:17             ` Basile STARYNKEVITCH
  2 siblings, 2 replies; 50+ messages in thread
From: Joseph S. Myers @ 2008-07-24 15:02 UTC (permalink / raw)
  To: Basile STARYNKEVITCH; +Cc: 'Agner Fog', gcc, TimothyPrince

On Thu, 24 Jul 2008, Basile STARYNKEVITCH wrote:

> At last, at the recent (july 2008) GCC summit, someone (sorry I forgot who,
> probably someone from SuSE) proposed in a BOFS to have architecture and
> machine specific hand-tuned (or even hand-written assembly) low level
> libraries for such basic things as memset etc..

I don't recall seeing any BOF minutes on this list yet this year.  Are 
people going to be posting them?

I don't know if it was proposed in this context, but the ARM EABI has 
various __aeabi_mem* functions for calls known to have particular 
alignment and the idea is relevant to other platforms if you provide such 
functions with the compiler.  The compiler could also generate calls to 
different functions depending on the -march options and so save the 
runtime CPU check cost (you could have options to call either generic 
versions, or versions for a particular CPU, depending on whether you are 
building a generic binary for CPU-X-or-newer or a binary just for CPU X).

As usual in this area, careful negotiation with the FSF at an early stage 
to be able to reuse glibc versions of the functions where useful would be 
a good idea.  Reusing the glibc testcases for string functions (that e.g. 
they don't access beyond the memory they are allowed to access at the end 
of a page) would be a good idea as well, and doesn't have the problems 
with changing licenses that reusing the functions themselves does.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: gcc will become the best optimizing x86 compiler
  2008-07-24 15:02           ` Joseph S. Myers
@ 2008-07-24 16:26             ` Agner Fog
  2008-07-24 17:17             ` Basile STARYNKEVITCH
  1 sibling, 0 replies; 50+ messages in thread
From: Agner Fog @ 2008-07-24 16:26 UTC (permalink / raw)
  To: Joseph S. Myers; +Cc: Basile STARYNKEVITCH, gcc, TimothyPrince

Joseph S. Myers wrote:

 >I don't know if it was proposed in this context, but the ARM EABI has
 >various __aeabi_mem* functions for calls known to have particular
 >alignment and the idea is relevant to other platforms if you provide such
 >functions with the compiler. The compiler could also generate calls to
 >different functions depending on the -march options and so save the
 >runtime CPU check cost (you could have options to call either generic
 >versions, or versions for a particular CPU, depending on whether you are
 >building a generic binary for CPU-X-or-newer or a binary just for CPU X).

memcpy in the Intel and Mac libraries, as well as my own code, have 
different branches for different alignments and different CPU 
instruction sets. The runtime cost for this branching is negligible 
compared to the gain, even when the byte count is small. No need to 
bother the programmer with different versions.

You can just copy the code from the Mac library, or from me.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: gcc will become the best optimizing x86 compiler
  2008-07-24 15:02           ` Joseph S. Myers
  2008-07-24 16:26             ` Agner Fog
@ 2008-07-24 17:17             ` Basile STARYNKEVITCH
  1 sibling, 0 replies; 50+ messages in thread
From: Basile STARYNKEVITCH @ 2008-07-24 17:17 UTC (permalink / raw)
  To: Joseph S. Myers; +Cc: 'Agner Fog', gcc, TimothyPrince

Joseph S. Myers wrote:
> On Thu, 24 Jul 2008, Basile STARYNKEVITCH wrote:
> 
>> At last, at the recent (july 2008) GCC summit, someone (sorry I forgot who,
>> probably someone from SuSE) proposed in a BOFS to have architecture and
>> machine specific hand-tuned (or even hand-written assembly) low level
>> libraries for such basic things as memset etc..
> 
> I don't recall seeing any BOF minutes on this list yet this year.  Are 
> people going to be posting them?


for the BOFS I did propose, the summary is on the wiki.

http://gcc.gnu.org/wiki/MakingGCCEasierToLearn

(I agree that my summary is not very good. Feel free to improve it)

Regards.

-- 
Basile STARYNKEVITCH         http://starynkevitch.net/Basile/
email: basile<at>starynkevitch<dot>net mobile: +33 6 8501 2359
8, rue de la Faiencerie, 92340 Bourg La Reine, France
*** opinions {are only mines, sont seulement les miennes} ***

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: gcc will become the best optimizing x86 compiler
  2008-07-24  9:41     ` Agner Fog
  2008-07-24 10:10       ` Dave Korn
@ 2008-07-24 17:21       ` Raksit Ashok
  2008-07-25  7:23         ` Agner Fog
  1 sibling, 1 reply; 50+ messages in thread
From: Raksit Ashok @ 2008-07-24 17:21 UTC (permalink / raw)
  To: Agner Fog; +Cc: dclarke, gcc, TimothyPrince

On Thu, Jul 24, 2008 at 1:03 AM, Agner Fog <agner@agner.org> wrote:
> Dennis Clarke wrote:
>>The Sun Studio 12 compiler with Solaris 10 on AMD Opteron or
>>UltraSparc beats GCC in almost every single test case that I have
>>seen.
>
> This is memcpy on Solaris:
> http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/i386/gen/memcpy.s
>
> It uses exactly the same method as memcpy on gcc libc, with only minor
> differences that have no influence on performance.

There is a more optimized version for 64-bit:
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/amd64/gen/memcpy.s

I think this looks similar to your implementation, Agner.

-raksit

>
>> Also, you have provided no data at all.
>
> I have linked to the data rather than copying it here to save space on the
> mailing list. Here is the link again:
> http://www.agner.org/optimize/optimizing_cpp.pdf  section 2.6, page 12.
>
>> So your assertions are those of a marketing person at the moment.
>
> Who sounds like a marketing person, you or me? :-)
>
>> Please post some code that can be compiled and then tested with high
>> resolution timers and perhaps
>> we can compare notes.
>
> Here is my code, again:
> http://www.agner.org/optimize/asmlib.zip
> My test results, referred to above, uses the "core clock cycles" performance
> counter on Intel and RDTSC on AMD. It's the highest resolution you can get.
> Feel free to do you own tests, it's as simple as linking my library into
> your test program.
>
> Tim Prince wrote:
>>you identify the library you tested only as "ubuntu g++ 4.2.3."
> Where can I see the libc version?
>
>>The corresponding 64-bit linux will see vastly different levels of
>> performance, depending on the
>>glibc version, as it doesn't use a builtin string move.
> Yes, this is exactly what my tests show. 64-bit libc is better than 32-bit
> libc, but still 3-4 times slower than the best library for unaligned
> operands on an Intel.
>
>>Certain newer CPUs aim to improve performance of the 32-bit gcc builtin
>> string moves, but don't
>> entirely eliminate the situations where it isn't optimum.
>
> The Intel manuals are not clear about this. Intel Optimization reference
> manual says:
>>In most cases, applications should take advantage of the default memory
>> routines provided by Intel compilers.
> What an excellent advice - the Intel compiler puts in a library with an
> automatic run-slowly-on-AMD feature!
> The Intel library does not use rep movs when running on an Intel CPU.
>
> The AMD software optimization guide mentions specific situations where rep
> movs is optimal. However, my tests on an Opteron (K8) tell that rep movs is
> never optimal on AMD either. I have no access to test it on the new AMD K10,
> but I expect the XMM register code to run much faster on K10 than on K8
> because K10 has 128-bit data paths where K8 has only 64-bit.
>
> Evidently, the problem with memcpy has been ignored for years, see
> http://softwarecommunity.intel.com/Wiki/Linux/719.htm
>
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: gcc will become the best optimizing x86 compiler
  2008-07-24 17:21       ` Raksit Ashok
@ 2008-07-25  7:23         ` Agner Fog
  2008-07-26  0:23           ` Michael Meissner
  2008-07-30 16:37           ` Denys Vlasenko
  0 siblings, 2 replies; 50+ messages in thread
From: Agner Fog @ 2008-07-25  7:23 UTC (permalink / raw)
  To: Raksit Ashok; +Cc: dclarke, gcc, TimothyPrince

Raksit Ashok wrote:
 >There is a more optimized version for 64-bit:
 >http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/amd64/gen/memcpy.s
 >I think this looks similar to your implementation, Agner.

Yes it is similar to my code.

Gnu libc could borrow a lot of optimized functions from Opensolaris and 
Mac and other open source projects. They look better than Gnu libc, but 
there is still room for improvement. For example, Opensolaris does not 
use XMM registers for strlen, although this is simpler than using 
general purpose registers (see my code www.agner.org/optimize/asmlib.zip)

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: gcc will become the best optimizing x86 compiler
  2008-07-25  7:23         ` Agner Fog
@ 2008-07-26  0:23           ` Michael Meissner
  2008-07-26 17:49             ` Agner Fog
  2008-07-28 11:45             ` Agner Fog
  2008-07-30 16:37           ` Denys Vlasenko
  1 sibling, 2 replies; 50+ messages in thread
From: Michael Meissner @ 2008-07-26  0:23 UTC (permalink / raw)
  To: Agner Fog; +Cc: Raksit Ashok, dclarke, gcc, TimothyPrince

On Fri, Jul 25, 2008 at 09:08:42AM +0200, Agner Fog wrote:
> Raksit Ashok wrote:
> >There is a more optimized version for 64-bit:
> >http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/amd64/gen/memcpy.s
> >I think this looks similar to your implementation, Agner.
>
> Yes it is similar to my code.
>
> Gnu libc could borrow a lot of optimized functions from Opensolaris and  
> Mac and other open source projects. They look better than Gnu libc, but  
> there is still room for improvement. For example, Opensolaris does not  
> use XMM registers for strlen, although this is simpler than using  
> general purpose registers (see my code www.agner.org/optimize/asmlib.zip)

Note, glibc can only take code that is appropriately licensed and donated to
the FSF.  In addition it must meet the coding standards for glibc.

Also note, that it depends on the basic chip level what is fastest for the
operation (for example, using XMM registers are not faster for current AMD
platforms).

Memcpy/memset optimizations were added to glibc 2.8, though when your favorite
distribution will provide it is a different question:
http://sourceware.org/ml/libc-alpha/2008-04/msg00050.html

-- 
Michael Meissner
email: gnu@the-meissners.org
http://www.the-meissners.org

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: gcc will become the best optimizing x86 compiler
  2008-07-26  0:23           ` Michael Meissner
@ 2008-07-26 17:49             ` Agner Fog
  2008-07-28 11:45             ` Agner Fog
  1 sibling, 0 replies; 50+ messages in thread
From: Agner Fog @ 2008-07-26 17:49 UTC (permalink / raw)
  To: Michael Meissner, Agner Fog, Raksit Ashok, dclarke, gcc, TimothyPrince

Michael Meissner wrote:
> On Fri, Jul 25, 2008 at 09:08:42AM +0200, Agner Fog wrote:
>   
>> Gnu libc could borrow a lot of optimized functions from Opensolaris and  
>> Mac and other open source projects. They look better than Gnu libc, but  
>> there is still room for improvement. For example, Opensolaris does not  
>> use XMM registers for strlen, although this is simpler than using  
>> general purpose registers (see my code www.agner.org/optimize/asmlib.zip)
>>     
>
> Note, glibc can only take code that is appropriately licensed and donated to
> the FSF.  In addition it must meet the coding standards for glibc.
>   
The Mac/Xnu and Opensolaris projects have fairly liberal public 
licenses. If there are legal differences, maybe the copyright owner is 
open to negotiation. My own code has GPL license. The fact that I am 
offering my code to you also means, of course, that I am willing to 
grant the necessary license.

> Also note, that it depends on the basic chip level what is fastest for the
> operation (for example, using XMM registers are not faster for current AMD
> platforms).
>   
Indeed. That's why I am talking about CPU dispatching (i.e. different 
branches for different CPUs). The CPU dispatching can be done with just 
a single jump instruction:
At the function entry there is an indirect jump through a pointer to the 
appropriate version. The code pointer initially points to a CPU 
dispatcher. The CPU dispatcher detects which CPU it is running on, and 
replaces the code pointer with a pointer to the appropriate version, 
then jumps to the pointer. The next time the function is called, it 
follows the pointer directly to the right version.

My memcpy runs faster with XMM registers than with 64-bit x64 registers 
on AMD K8.
My strlen runs slower with XMM registers than with 64-bit x64 registers 
on AMD K8.

I expect the XMM versions to run much faster on AMD K10, because it has 
full 128-bit execution units and data paths, where K8 has only 64-bits. 
I have not had the chance to test this on AMD K10 yet.

I believe it is best to optimize for the newest processors, because the 
processor that is brand new today will become mainstream in a few years.
> Memcpy/memset optimizations were added to glibc 2.8, though when your favorite
> distribution will provide it is a different question:
> http://sourceware.org/ml/libc-alpha/2008-04/msg00050.html
>   
I have libc version 2.7. Can't find version 2.8.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: gcc will become the best optimizing x86 compiler
  2008-07-24 13:59           ` Agner Fog
  2008-07-24 14:40             ` Richard Guenther
@ 2008-07-28 10:57             ` Andrew Haley
  1 sibling, 0 replies; 50+ messages in thread
From: Andrew Haley @ 2008-07-28 10:57 UTC (permalink / raw)
  To: Agner Fog; +Cc: Basile STARYNKEVITCH, gcc, TimothyPrince

Agner Fog wrote:
> Basile STARYNKEVITCH wrote:
>>At last, at the recent (july 2008) GCC summit, someone (sorry I forgot
> who, probably someone from SuSE)
>> proposed in a BOFS to have architecture and machine specific
> hand-tuned (or even hand-written assembly) low
>> level libraries for such basic things as memset etc..
> 
> That's exactly what I meant. The most important memory, string and math
> functions should use hand-tuned assembly with CPU dispatching for the
> latest instruction sets. My experiments show that the speed can be
> improved by a factor 3 - 10 for unaligned memcpy on Intel processors
> (http://www.agner.org/optimize/optimizing_cpp.pdf page 12).

Is this still true if you have to go through the PLT to make a position-
independent call?  That's the most common case for userspace on GNU/Linux.

Andrew.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: gcc will become the best optimizing x86 compiler
  2008-07-26  0:23           ` Michael Meissner
  2008-07-26 17:49             ` Agner Fog
@ 2008-07-28 11:45             ` Agner Fog
  2008-07-28 14:40               ` Daniel Jacobowitz
  2008-07-28 17:19               ` Michael Matz
  1 sibling, 2 replies; 50+ messages in thread
From: Agner Fog @ 2008-07-28 11:45 UTC (permalink / raw)
  To: Michael Meissner, Agner Fog, Raksit Ashok, dclarke, gcc,
	TimothyPrince, Tarjei Knapstad

Michael Meissner wrote:
 >Memcpy/memset optimizations were added to glibc 2.8, though when your 
favorite
 >distribution will provide it is a different question:
 >http://sourceware.org/ml/libc-alpha/2008-04/msg00050.html

I finally got a SUSE with glibc 2.8. I can see that 32-bit memcpy has 
been modified with an extra misalignment branch, but no significant 
improvement. Glibc 2.8 is NOT faster than glibc 2.7 in my tests. It 
still doesn't use XMM registers.

Glibc 2.8 is still almost 5 times slower than the best function 
libraries for unaligned data on Intel Core 2, and the default builtin 
function is slower than any other implementation I have seen (copies 1 
byte at a time!).

Tarjei Knapstad wrote:
 >2008/7/26 Agner Fog <agner@agner.org>:
 >>I have libc version 2.7. Can't find version 2.8
 >It's in Fedora 9, I have no idea why the source isn't directly
 >available from the glibc homepage.

2.8 is not an official final release yet.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: gcc will become the best optimizing x86 compiler
  2008-07-28 11:45             ` Agner Fog
@ 2008-07-28 14:40               ` Daniel Jacobowitz
  2008-07-28 17:37                 ` Dennis Clarke
  2008-07-28 17:19               ` Michael Matz
  1 sibling, 1 reply; 50+ messages in thread
From: Daniel Jacobowitz @ 2008-07-28 14:40 UTC (permalink / raw)
  To: Agner Fog
  Cc: Michael Meissner, Raksit Ashok, dclarke, gcc, TimothyPrince,
	Tarjei Knapstad

On Mon, Jul 28, 2008 at 12:56:57PM +0200, Agner Fog wrote:
> >2008/7/26 Agner Fog <agner@agner.org>:
> >>I have libc version 2.7. Can't find version 2.8
> >It's in Fedora 9, I have no idea why the source isn't directly
> >available from the glibc homepage.
>
> 2.8 is not an official final release yet.

That's incorrect; the glibc maintainers just don't care much for
tarballs.  You can find the tag in CVS from several months ago.

-- 
Daniel Jacobowitz
CodeSourcery

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: gcc will become the best optimizing x86 compiler
  2008-07-28 11:45             ` Agner Fog
  2008-07-28 14:40               ` Daniel Jacobowitz
@ 2008-07-28 17:19               ` Michael Matz
  2008-07-29  6:15                 ` Agner Fog
  1 sibling, 1 reply; 50+ messages in thread
From: Michael Matz @ 2008-07-28 17:19 UTC (permalink / raw)
  To: Agner Fog
  Cc: Michael Meissner, Raksit Ashok, dclarke, gcc, TimothyPrince,
	Tarjei Knapstad

Hi,

On Mon, 28 Jul 2008, Agner Fog wrote:

> Glibc 2.8 is still almost 5 times slower than the best function 
> libraries for unaligned data on Intel Core 2, and the default builtin 
> function is slower than any other implementation I have seen (copies 1 
> byte at a time!).

You must be doing something wrong.  If the compiler decides to inline the 
string ops it either knows the size or you told it to do it anyway 
(-minline-all-stringops or -minline-stringops-dynamically).  In both cases 
will it use wider than byte moves when possible.


Ciao,
Michael.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: gcc will become the best optimizing x86 compiler
  2008-07-28 14:40               ` Daniel Jacobowitz
@ 2008-07-28 17:37                 ` Dennis Clarke
  2008-07-28 17:54                   ` Paolo Carlini
  0 siblings, 1 reply; 50+ messages in thread
From: Dennis Clarke @ 2008-07-28 17:37 UTC (permalink / raw)
  To: Agner Fog, Michael Meissner, Raksit Ashok, dclarke, gcc,
	TimothyPrince, Tarjei Knapstad

On Mon, Jul 28, 2008 at 8:10 AM, Daniel Jacobowitz <drow@false.org> wrote:
> On Mon, Jul 28, 2008 at 12:56:57PM +0200, Agner Fog wrote:
>> >2008/7/26 Agner Fog <agner@agner.org>:
>> >>I have libc version 2.7. Can't find version 2.8
>> >It's in Fedora 9, I have no idea why the source isn't directly
>> >available from the glibc homepage.
>>
>> 2.8 is not an official final release yet.
>
> That's incorrect; the glibc maintainers just don't care much for
> tarballs.  You can find the tag in CVS from several months ago.

this page :

    http://www.gnu.org/software/libc/

says :

    Current Status
        The current version is 2.7.

        See the NEWS file for more information.

        There is a FAQ which you should read first.


also, IMO, the NEWS sections says nothing useful to any human.

Dennis

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: gcc will become the best optimizing x86 compiler
  2008-07-28 17:37                 ` Dennis Clarke
@ 2008-07-28 17:54                   ` Paolo Carlini
  2008-07-28 18:31                     ` Dennis Clarke
  0 siblings, 1 reply; 50+ messages in thread
From: Paolo Carlini @ 2008-07-28 17:54 UTC (permalink / raw)
  To: dclarke; +Cc: gcc

Dennis Clarke wrote:
> also, IMO, the NEWS sections says nothing useful to any human.
>   
but, *some* humans like to click on the first (download) link on top.

Paolo.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: gcc will become the best optimizing x86 compiler
  2008-07-28 17:54                   ` Paolo Carlini
@ 2008-07-28 18:31                     ` Dennis Clarke
  2008-07-28 18:37                       ` Ian Lance Taylor
                                         ` (2 more replies)
  0 siblings, 3 replies; 50+ messages in thread
From: Dennis Clarke @ 2008-07-28 18:31 UTC (permalink / raw)
  To: Paolo Carlini; +Cc: gcc

On Mon, Jul 28, 2008 at 1:17 PM, Paolo Carlini <paolo.carlini@oracle.com> wrote:
> Dennis Clarke wrote:
>>
>> also, IMO, the NEWS sections says nothing useful to any human.
>>
>
> but, *some* humans like to click on the first (download) link on top.

where ?

It says

Availability
The releases are available at http://ftp.gnu.org/gnu/glibc/ and its mirrors.

which has glibc-2.7.tar.bz2 as the latest.

hold on .. on the NEWS page I see ... okay .. how very user friendly.
Sort of the thing one would put on the project homepage I would think.

Dennis

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: gcc will become the best optimizing x86 compiler
  2008-07-28 18:31                     ` Dennis Clarke
@ 2008-07-28 18:37                       ` Ian Lance Taylor
  2008-07-28 19:44                       ` Dave Korn
  2008-07-29  1:31                       ` Gerald Pfeifer
  2 siblings, 0 replies; 50+ messages in thread
From: Ian Lance Taylor @ 2008-07-28 18:37 UTC (permalink / raw)
  To: dclarke; +Cc: Paolo Carlini, gcc

"Dennis Clarke" <blastwave@gmail.com> writes:

> hold on .. on the NEWS page I see ... okay .. how very user friendly.
> Sort of the thing one would put on the project homepage I would think.

The glibc project has their own special approach to user friendliness.

Ian

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: gcc will become the best optimizing x86 compiler
  2008-07-28 18:31                     ` Dennis Clarke
  2008-07-28 18:37                       ` Ian Lance Taylor
@ 2008-07-28 19:44                       ` Dave Korn
  2008-07-28 21:40                         ` Dennis Clarke
  2008-07-29  1:31                       ` Gerald Pfeifer
  2 siblings, 1 reply; 50+ messages in thread
From: Dave Korn @ 2008-07-28 19:44 UTC (permalink / raw)
  To: dclarke, 'Paolo Carlini'; +Cc: gcc

Dennis Clarke wrote on 28 July 2008 18:54:

> On Mon, Jul 28, 2008 at 1:17 PM, Paolo Carlini <paolo.carlini@oracle.com>
> wrote: 
>> Dennis Clarke wrote:
>>> 
>>> also, IMO, the NEWS sections says nothing useful to any human.
>>> 
>> 
>> but, *some* humans like to click on the first (download) link on top.
> 
> where ?
> 
> It says
> 
> Availability
> The releases are available at http://ftp.gnu.org/gnu/glibc/ and its
> mirrors. 
> 
> which has glibc-2.7.tar.bz2 as the latest.
> 
> hold on .. on the NEWS page I see ... okay .. how very user friendly.
> Sort of the thing one would put on the project homepage I would think.

  It's not the NEWS page; it's a link to the source of the NEWS file stored
in the glibc CVS repository.

  The gnu.org page is rather out of date, and a bit obfuscated.

  Most GNU projects have a prominent link in their gnu.org directory page to
the actual project home page; in this case it's tucked away on the
"resources" page in the "Project website" section.  (Oh, and it still points
to "sources.redhat.com", which is a sign of just how out-of-date that
gnu.org page really is...)

  Follow that link, and you'll see the *real* project home, with the real
news and the real latest-release info, and the real list of mailing lists,
and the wiki, and ...

    cheers,
      DaveK
-- 
Can't think of a witty .sigline today....

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: gcc will become the best optimizing x86 compiler
  2008-07-28 19:44                       ` Dave Korn
@ 2008-07-28 21:40                         ` Dennis Clarke
  0 siblings, 0 replies; 50+ messages in thread
From: Dennis Clarke @ 2008-07-28 21:40 UTC (permalink / raw)
  To: Dave Korn; +Cc: Paolo Carlini, gcc

On Mon, Jul 28, 2008 at 2:30 PM, Dave Korn <dave.korn@artimi.com> wrote:
> Dennis Clarke wrote on 28 July 2008 18:54:
>
>> On Mon, Jul 28, 2008 at 1:17 PM, Paolo Carlini <paolo.carlini@oracle.com>
>> wrote:
>>> Dennis Clarke wrote:
>>>>
>>>> also, IMO, the NEWS sections says nothing useful to any human.
>>>>
>>>
>>> but, *some* humans like to click on the first (download) link on top.
>>
>> where ?
>>
>> It says
>>
>> Availability
>> The releases are available at http://ftp.gnu.org/gnu/glibc/ and its
>> mirrors.
>>
>> which has glibc-2.7.tar.bz2 as the latest.
>>
>> hold on .. on the NEWS page I see ... okay .. how very user friendly.
>> Sort of the thing one would put on the project homepage I would think.
>
>  It's not the NEWS page; it's a link to the source of the NEWS file stored
> in the glibc CVS repository.
>
>  The gnu.org page is rather out of date, and a bit obfuscated.
>
>  Most GNU projects have a prominent link in their gnu.org directory page to
> the actual project home page; in this case it's tucked away on the
> "resources" page in the "Project website" section.  (Oh, and it still points
> to "sources.redhat.com", which is a sign of just how out-of-date that
> gnu.org page really is...)
>
>  Follow that link, and you'll see the *real* project home, with the real
> news and the real latest-release info, and the real list of mailing lists,
> and the wiki, and ...

the *real* wiki ?  :-)

Dennis

ps: I used CVS to get the sources.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: gcc will become the best optimizing x86 compiler
  2008-07-28 18:31                     ` Dennis Clarke
  2008-07-28 18:37                       ` Ian Lance Taylor
  2008-07-28 19:44                       ` Dave Korn
@ 2008-07-29  1:31                       ` Gerald Pfeifer
  2008-07-29  6:29                         ` Agner Fog
  2 siblings, 1 reply; 50+ messages in thread
From: Gerald Pfeifer @ 2008-07-29  1:31 UTC (permalink / raw)
  To: dclarke; +Cc: Paolo Carlini, gcc

On Mon, 28 Jul 2008, Dennis Clarke wrote:
> hold on .. on the NEWS page I see ... okay .. how very user friendly.
> Sort of the thing one would put on the project homepage I would think.

See how user friendly we in GCC-land are in comparison? ;-)

Gerald

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: gcc will become the best optimizing x86 compiler
  2008-07-28 17:19               ` Michael Matz
@ 2008-07-29  6:15                 ` Agner Fog
  2008-07-29  9:31                   ` Richard Guenther
                                     ` (2 more replies)
  0 siblings, 3 replies; 50+ messages in thread
From: Agner Fog @ 2008-07-29  6:15 UTC (permalink / raw)
  To: Michael Matz
  Cc: Michael Meissner, Raksit Ashok, dclarke, gcc, TimothyPrince,
	Tarjei Knapstad

Michael Matz wrote:
> You must be doing something wrong.  If the compiler decides to inline the 
> string ops it either knows the size or you told it to do it anyway 
> (-minline-all-stringops or -minline-stringops-dynamically).  In both cases 
> will it use wider than byte moves when possible.
>   
g++ (v. 4.2.3) without any options converts memcpy with unknown size to  
rep movsb
g++ with option -fno-builtin calls memcpy in libc

The rep movs, stos, scas, cmps instructions are slower than function 
calls except in rare cases. The compiler should never use the string 
instructions. It is OK to use mov instructions if the size is known, but 
not string instructions.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: gcc will become the best optimizing x86 compiler
  2008-07-29  1:31                       ` Gerald Pfeifer
@ 2008-07-29  6:29                         ` Agner Fog
  2008-07-29  9:24                           ` Ben Elliston
  0 siblings, 1 reply; 50+ messages in thread
From: Agner Fog @ 2008-07-29  6:29 UTC (permalink / raw)
  To: Gerald Pfeifer; +Cc: dclarke, Paolo Carlini, gcc

Gerald Pfeifer wrote:
> See how user friendly we in GCC-land are in comparison? ;-)
>   
Since there is no libc mailing list, I thought that the gcc list is the 
place to contact the maintainers of libc. Am I on the wrong list? Or are 
there no maintainers of libc?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: gcc will become the best optimizing x86 compiler
  2008-07-29  6:29                         ` Agner Fog
@ 2008-07-29  9:24                           ` Ben Elliston
  2008-07-31  8:12                             ` Christopher Faylor
  0 siblings, 1 reply; 50+ messages in thread
From: Ben Elliston @ 2008-07-29  9:24 UTC (permalink / raw)
  To: Agner Fog; +Cc: gcc

> Since there is no libc mailing list, I thought that the gcc list is the 
> place to contact the maintainers of libc. Am I on the wrong list? Or are 
> there no maintainers of libc?

See:
  http://sources.redhat.com/glibc/

You want the libc-alpha list, I think.

Cheers, Ben


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: gcc will become the best optimizing x86 compiler
  2008-07-29  6:15                 ` Agner Fog
@ 2008-07-29  9:31                   ` Richard Guenther
  2008-07-29  9:55                     ` Steven Bosscher
  2008-07-29 14:11                   ` Michael Matz
  2008-07-29 14:45                   ` Tim Prince
  2 siblings, 1 reply; 50+ messages in thread
From: Richard Guenther @ 2008-07-29  9:31 UTC (permalink / raw)
  To: Agner Fog
  Cc: Michael Matz, Michael Meissner, Raksit Ashok, dclarke, gcc,
	TimothyPrince, Tarjei Knapstad

On Tue, Jul 29, 2008 at 7:26 AM, Agner Fog <agner@agner.org> wrote:
> Michael Matz wrote:
>>
>> You must be doing something wrong.  If the compiler decides to inline the
>> string ops it either knows the size or you told it to do it anyway
>> (-minline-all-stringops or -minline-stringops-dynamically).  In both cases
>> will it use wider than byte moves when possible.
>>
>
> g++ (v. 4.2.3) without any options converts memcpy with unknown size to  rep
> movsb

Make sure to use -D__NO_STRING_INLINES to not get glibcs inline
implementation.

Richard.

> g++ with option -fno-builtin calls memcpy in libc
>
> The rep movs, stos, scas, cmps instructions are slower than function calls
> except in rare cases. The compiler should never use the string instructions.
> It is OK to use mov instructions if the size is known, but not string
> instructions.
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: gcc will become the best optimizing x86 compiler
  2008-07-29  9:31                   ` Richard Guenther
@ 2008-07-29  9:55                     ` Steven Bosscher
  2008-07-29 13:09                       ` Joseph S. Myers
  0 siblings, 1 reply; 50+ messages in thread
From: Steven Bosscher @ 2008-07-29  9:55 UTC (permalink / raw)
  To: Richard Guenther
  Cc: Agner Fog, Michael Matz, Michael Meissner, Raksit Ashok, dclarke,
	gcc, TimothyPrince, Tarjei Knapstad

On Tue, Jul 29, 2008 at 11:26 AM, Richard Guenther
<richard.guenther@gmail.com> wrote:
>> g++ (v. 4.2.3) without any options converts memcpy with unknown size to  rep
>> movsb
>
> Make sure to use -D__NO_STRING_INLINES to not get glibcs inline
> implementation.

Why is this not the default?

Gr.
Steven

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: gcc will become the best optimizing x86 compiler
  2008-07-29  9:55                     ` Steven Bosscher
@ 2008-07-29 13:09                       ` Joseph S. Myers
  0 siblings, 0 replies; 50+ messages in thread
From: Joseph S. Myers @ 2008-07-29 13:09 UTC (permalink / raw)
  To: Steven Bosscher
  Cc: Richard Guenther, Agner Fog, Michael Matz, Michael Meissner,
	Raksit Ashok, dclarke, gcc, TimothyPrince, Tarjei Knapstad

On Tue, 29 Jul 2008, Steven Bosscher wrote:

> On Tue, Jul 29, 2008 at 11:26 AM, Richard Guenther
> <richard.guenther@gmail.com> wrote:
> >> g++ (v. 4.2.3) without any options converts memcpy with unknown size to  rep
> >> movsb
> >
> > Make sure to use -D__NO_STRING_INLINES to not get glibcs inline
> > implementation.
> 
> Why is this not the default?

Because GNU projects are supposed to work together rather than forcibly 
overriding each other.  As GCC gets optimizations that obsolete particular 
parts of the optimizations in glibc's headers, Jakub updates the glibc 
headers to have only those optimizations not obsoleted by GCC (some call 
particular glibc-specific functions GCC doesn't know about, for example), 
depending on the GCC version.  If GCC were to override glibc 
unconditionally, for all the inline implementations, then the natural 
consequence would be for glibc to change __NO_STRING_INLINES to 
__REALLY_NO_STRING_INLINES, and so on - this macro is for the user to 
override, if particular inlines are not needed or not optimal for 
particular compiler versions or processors then the headers should be 
updated in glibc.  If you have issues with particular inlines (not limited 
to string functions), please file bugs in glibc Bugzilla, send patches to 
libc-alpha or contact Jakub.

Anyone finding memcpy converted inappropriately needs to give the full 
testcase - both original and preprocessed source - and full command-line 
options, so we can tell what inlines if any are being used.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: gcc will become the best optimizing x86 compiler
  2008-07-29  6:15                 ` Agner Fog
  2008-07-29  9:31                   ` Richard Guenther
@ 2008-07-29 14:11                   ` Michael Matz
  2008-07-29 14:45                   ` Tim Prince
  2 siblings, 0 replies; 50+ messages in thread
From: Michael Matz @ 2008-07-29 14:11 UTC (permalink / raw)
  To: Agner Fog
  Cc: Michael Meissner, Raksit Ashok, dclarke, gcc, TimothyPrince,
	Tarjei Knapstad

Hi,

On Tue, 29 Jul 2008, Agner Fog wrote:

> g++ (v. 4.2.3) without any options converts memcpy with unknown size to  rep
> movsb

Use newer GCCs.  They will (1) not expand memcpy inline for unknown sizes 
(without special options, also make sure you don't get the glibc inlines) 
and (2) won't expand to movsb.

> The rep movs, stos, scas, cmps instructions are slower than function 
> calls except in rare cases.

Depends on the microarchitecture.  For AMD Fam10 for instance REP prefixes 
are the preferred form for sizes between page-size and half of L1 size, 
when destination is aligned.

> The compiler should never use the string instructions. It is OK to use 
> mov instructions if the size is known, but not string instructions.

General statements are generally wrong :)

Ciao,
Michael.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: gcc will become the best optimizing x86 compiler
  2008-07-29  6:15                 ` Agner Fog
  2008-07-29  9:31                   ` Richard Guenther
  2008-07-29 14:11                   ` Michael Matz
@ 2008-07-29 14:45                   ` Tim Prince
  2 siblings, 0 replies; 50+ messages in thread
From: Tim Prince @ 2008-07-29 14:45 UTC (permalink / raw)
  To: Agner Fog
  Cc: Michael Matz, Michael Meissner, Raksit Ashok, dclarke, gcc,
	TimothyPrince, Tarjei Knapstad

Agner Fog wrote:
> Michael Matz wrote:
>> You must be doing something wrong.  If the compiler decides to inline 
>> the string ops it either knows the size or you told it to do it anyway 
>> (-minline-all-stringops or -minline-stringops-dynamically).  In both 
>> cases will it use wider than byte moves when possible.
>>   
> g++ (v. 4.2.3) without any options converts memcpy with unknown size to  
> rep movsb
> g++ with option -fno-builtin calls memcpy in libc
> 
> The rep movs, stos, scas, cmps instructions are slower than function 
> calls except in rare cases. The compiler should never use the string 
> instructions. It is OK to use mov instructions if the size is known, but 
> not string instructions.
I assume Agner is talking about the i386 target defaults, while Michael 
was talking about other target defaults.  People who code for i386 must 
often use memcpy for short or medium length unaligned strings, and should 
be aware of the issues of long memcpy strings.  Even for i386, the 
compiler should recognize where, for example, a single int move is 
explicitly the right thing.  The rep string issue has been prominent 
enough for newer CPUs to be designed to recover some of the performance.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: gcc will become the best optimizing x86 compiler
  2008-07-25  7:23         ` Agner Fog
  2008-07-26  0:23           ` Michael Meissner
@ 2008-07-30 16:37           ` Denys Vlasenko
  2008-07-30 16:40             ` Denys Vlasenko
  1 sibling, 1 reply; 50+ messages in thread
From: Denys Vlasenko @ 2008-07-30 16:37 UTC (permalink / raw)
  To: Agner Fog; +Cc: Raksit Ashok, dclarke, gcc, TimothyPrince

On Fri, Jul 25, 2008 at 9:08 AM, Agner Fog <agner@agner.org> wrote:
> Raksit Ashok wrote:
>>There is a more optimized version for 64-bit:
>>http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/amd64/gen/memcpy.s
>>I think this looks similar to your implementation, Agner.
>
> Yes it is similar to my code.

3164 line source file which implements memcpy().
You got to be kidding.
How much of L1 icache it blows away in the process?
I bet it performs wonderfully on microbenchmarks though.

   2991 		.balign 16               # sadistic alignment strikes again
   2992 L(bkPxQx):	.int L(bkP0Q0)-L(bkPxQx) # why use two bytes when
we can use four?

Seriously. What possible reason there can be to align
a randomly accessed data table to 16 bytes?
4 bytes I understand, but 16?
--
vda

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: gcc will become the best optimizing x86 compiler
  2008-07-30 16:37           ` Denys Vlasenko
@ 2008-07-30 16:40             ` Denys Vlasenko
  2008-07-30 17:52               ` Agner Fog
  0 siblings, 1 reply; 50+ messages in thread
From: Denys Vlasenko @ 2008-07-30 16:40 UTC (permalink / raw)
  To: Agner Fog; +Cc: Raksit Ashok, dclarke, gcc, TimothyPrince

On Wed, Jul 30, 2008 at 5:57 PM, Denys Vlasenko
<vda.linux@googlemail.com> wrote:
> On Fri, Jul 25, 2008 at 9:08 AM, Agner Fog <agner@agner.org> wrote:
>> Raksit Ashok wrote:
>>>There is a more optimized version for 64-bit:
>>>http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/amd64/gen/memcpy.s
>>>I think this looks similar to your implementation, Agner.
>>
>> Yes it is similar to my code.
>
> 3164 line source file which implements memcpy().
> You got to be kidding.
> How much of L1 icache it blows away in the process?
> I bet it performs wonderfully on microbenchmarks though.
>
>   2991                 .balign 16               # sadistic alignment strikes again
>   2992 L(bkPxQx):      .int L(bkP0Q0)-L(bkPxQx) # why use two bytes when
> we can use four?
>
> Seriously. What possible reason there can be to align
> a randomly accessed data table to 16 bytes?
> 4 bytes I understand, but 16?

I'm afraid I sounded a bit confrontational above, here comes the
clarification. I have nothing against making code faster.
But there should be some balance between -O999 mindset
and -Os midset. If you just found a tweak which gives you 1.2%
speedup in microbencmark but code grew 4 times bigger, *stop*.
Think about it.

"We unrolled the loop two gazillion times and it's 3% faster now"
is a similarly bad idea.

I must admit that I didn't look too closely at
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/amd64/gen/memcpy.s
but at the first glance it sure looks like someone
got carried away a bit.
--
vda

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: gcc will become the best optimizing x86 compiler
  2008-07-30 16:40             ` Denys Vlasenko
@ 2008-07-30 17:52               ` Agner Fog
  2008-07-30 22:42                 ` Dennis Clarke
  2008-07-31  2:57                 ` Denys Vlasenko
  0 siblings, 2 replies; 50+ messages in thread
From: Agner Fog @ 2008-07-30 17:52 UTC (permalink / raw)
  To: Denys Vlasenko; +Cc: Raksit Ashok, dclarke, gcc, TimothyPrince

Denys Vlasenko wrote:
>> 3164 line source file which implements memcpy().
>> You got to be kidding.
>> How much of L1 icache it blows away in the process?
>> I bet it performs wonderfully on microbenchmarks though.
>>     
I agree that the OpenSolaris memcpy is bigger than necessary. However, 
it is necessary to have 16 branches for covering all possible alignments 
modulo 16. This is because, unfortunately, there is no XMM shift 
instruction with a variable count, only with a constant count, so we 
need one branch for each value of the shift count. Since only one of the 
branches is used, it doesn't take much space in the code cache. The 
speed is improved by a factor 4-5 by this 16-branch algorithm, so it is 
certainly worth the extra complexity.

The future AMD SSE5 instruction set offers a possibility to join the 
many branches into one, but only on AMD processors. Intel is not going 
to support SSE5, and the future Intel AVX instruction set doesn't have 
an instruction that can be used for this purpose. So we will need 
separate branches for Intel and AMD code in future implementation of 
libc. (Explained in www.agner.org/optimize/asmexamples.zip).

> "We unrolled the loop two gazillion times and it's 3% faster now"
> is a similarly bad idea.
>   
I agree completely. My memcpy code is much smaller than the OpenSolaris 
and Mac implementations and approximately equally fast. Some compilers 
unroll loops way too much in my opinion.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: gcc will become the best optimizing x86 compiler
  2008-07-30 17:52               ` Agner Fog
@ 2008-07-30 22:42                 ` Dennis Clarke
  2008-07-31  2:57                 ` Denys Vlasenko
  1 sibling, 0 replies; 50+ messages in thread
From: Dennis Clarke @ 2008-07-30 22:42 UTC (permalink / raw)
  To: Agner Fog; +Cc: Denys Vlasenko, Raksit Ashok, gcc, TimothyPrince

On Wed, Jul 30, 2008 at 5:14 PM, Agner Fog <agner@agner.org> wrote:
> Denys Vlasenko wrote:
>>>
>>> 3164 line source file which implements memcpy().
>>> You got to be kidding.
>>> How much of L1 icache it blows away in the process?
>>> I bet it performs wonderfully on microbenchmarks though.
>>>
>
> I agree that the OpenSolaris memcpy is bigger than necessary. However, it is
> necessary to have 16 branches for covering all possible alignments modulo
> 16. This is because, unfortunately, there is no XMM shift instruction with a
> variable count, only with a constant count, so we need one branch for each
> value of the shift count. Since only one of the branches is used, it doesn't
> take much space in the code cache. The speed is improved by a factor 4-5 by
> this 16-branch algorithm, so it is certainly worth the extra complexity.

You forgot to look at PowerPC :

http://cvs.opensolaris.org/source/xref/ppc-dev/ppc-dev/usr/src/lib/libc/ppc/gen/memcpy.s

is that nice and small ?


Dennis Clarke

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: gcc will become the best optimizing x86 compiler
  2008-07-30 17:52               ` Agner Fog
  2008-07-30 22:42                 ` Dennis Clarke
@ 2008-07-31  2:57                 ` Denys Vlasenko
  2008-07-31  8:18                   ` Agner Fog
  1 sibling, 1 reply; 50+ messages in thread
From: Denys Vlasenko @ 2008-07-31  2:57 UTC (permalink / raw)
  To: Agner Fog; +Cc: Raksit Ashok, dclarke, gcc, TimothyPrince

On Wednesday 30 July 2008 19:14, Agner Fog wrote:
> I agree that the OpenSolaris memcpy is bigger than necessary. However, 
> it is necessary to have 16 branches for covering all possible alignments 
> modulo 16. This is because, unfortunately, there is no XMM shift 
> instruction with a variable count, only with a constant count, so we 
> need one branch for each value of the shift count. Since only one of the 
> branches is used, it doesn't take much space in the code cache. The 
> speed is improved by a factor 4-5 by this 16-branch algorithm, so it is 
> certainly worth the extra complexity.

I tend to doubt that odd-byte aligned large memcpys are anywhere
near typical. malloc and mmap both return well-aligned buffers
(say, 8 byte aligned). Static and on-stack objects are also
at least word-aligned 99% of the time.

memcpy can just use "relatively simple" code for copies in which
either src or dst is not word aligned. This cuts possibilities down
from 16 to 4 (or even 2?).
--
vda

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: gcc will become the best optimizing x86 compiler
  2008-07-29  9:24                           ` Ben Elliston
@ 2008-07-31  8:12                             ` Christopher Faylor
  0 siblings, 0 replies; 50+ messages in thread
From: Christopher Faylor @ 2008-07-31  8:12 UTC (permalink / raw)
  To: gcc, Agner Fog, Ben Elliston

On Tue, Jul 29, 2008 at 04:14:49PM +1000, Ben Elliston wrote:
>> Since there is no libc mailing list, I thought that the gcc list is the 
>> place to contact the maintainers of libc. Am I on the wrong list? Or are 
>> there no maintainers of libc?
>
>See:
>  http://sources.redhat.com/glibc/
>
>You want the libc-alpha list, I think.

I think libc-help is a more likely place to start.

cgf

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: gcc will become the best optimizing x86 compiler
  2008-07-31  2:57                 ` Denys Vlasenko
@ 2008-07-31  8:18                   ` Agner Fog
  2008-07-31 11:00                     ` Dave Korn
  0 siblings, 1 reply; 50+ messages in thread
From: Agner Fog @ 2008-07-31  8:18 UTC (permalink / raw)
  To: Denys Vlasenko; +Cc: Raksit Ashok, dclarke, gcc, TimothyPrince

Denys Vlasenko wrote:
> I tend to doubt that odd-byte aligned large memcpys are anywhere
> near typical. malloc and mmap both return well-aligned buffers
> (say, 8 byte aligned). Static and on-stack objects are also
> at least word-aligned 99% of the time.
>
> memcpy can just use "relatively simple" code for copies in which
> either src or dst is not word aligned. This cuts possibilities down
> from 16 to 4 (or even 2?).
>   
The XMM code is still more than 3 times faster than rep movsl when data 
are aligned by 4 or 8, but not by 16.
Even if odd addresses are rare, they must be supported, but we can put 
the most common cases first.
strcpy and strcat can be implemented efficiently simply by calling 
strlen and memcpy, since both strlen and memcpy can be optimized very 
well. This can give unaligned addresses.

Dennis Clarke wrote:
> You forgot to look at PowerPC :
>
> http://cvs.opensolaris.org/source/xref/ppc-dev/ppc-dev/usr/src/lib/libc/ppc/gen/memcpy.s
>
> is that nice and small ?
>   
.. and slow. Why doesn't it use Altivec?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: gcc will become the best optimizing x86 compiler
  2008-07-31  8:18                   ` Agner Fog
@ 2008-07-31 11:00                     ` Dave Korn
  0 siblings, 0 replies; 50+ messages in thread
From: Dave Korn @ 2008-07-31 11:00 UTC (permalink / raw)
  To: 'Agner Fog', 'Denys Vlasenko'
  Cc: 'Raksit Ashok', dclarke, gcc, TimothyPrince

Agner Fog wrote on 31 July 2008 07:14:

> Denys Vlasenko wrote:
>> I tend to doubt that odd-byte aligned large memcpys are anywhere
>> near typical. malloc and mmap both return well-aligned buffers
>> (say, 8 byte aligned). Static and on-stack objects are also
>> at least word-aligned 99% of the time.
>> 
>> memcpy can just use "relatively simple" code for copies in which
>> either src or dst is not word aligned. This cuts possibilities down from
>> 16 to 4 (or even 2?). 
>> 
> The XMM code is still more than 3 times faster than rep movsl when data
> are aligned by 4 or 8, but not by 16.
> Even if odd addresses are rare, they must be supported, but we can put
> the most common cases first.


  In the real world, unaligned memcpys are anything but rare.  Everything's
networked these days, remember?  Stuff gets misaligned real quick when you
start adding and removing various network layer headers and trailers to
unpredictably-sized packets.


    cheers,
      DaveK
-- 
Can't think of a witty .sigline today....

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: gcc will become the best optimizing x86 compiler
  2008-07-30 15:34 Eus
@ 2008-07-30 16:09 ` Dennis Clarke
  0 siblings, 0 replies; 50+ messages in thread
From: Dennis Clarke @ 2008-07-30 16:09 UTC (permalink / raw)
  To: eus; +Cc: GCC Development Mailing List

On Wed, Jul 30, 2008 at 3:23 PM, Eus <eus@member.fsf.org> wrote:
> Hi Ho!
>
> --- On Tue, 7/29/08, "Dennis Clarke" <blastwave@gmail.com> wrote:
>
>> hold on .. on the NEWS page I see ... okay .. how very user friendly.
>> Sort of the thing one would put on the project homepage I would think.
>
> Do you mind to tell me what you saw?
> I was looking for the interesting part on the latest release of the NEWS on the CVS but to no avail.
>
> Thank you for your help.

It says :

GNU C Library NEWS -- history of user-visible changes.  2008-7-27
Copyright (C) 1992-2007, 2008 Free Software Foundation, Inc.
See the end for copying conditions.

Please send GNU C library bug reports via <http://sources.redhat.com/bugzilla/>
using `glibc' in the "product" field.
\f
Version 2.9

* Unified lookup for getaddrinfo: IPv4 and IPv6 addresses are now looked
  up at the same time.  Implemented by Ulrich Drepper.

* TLS descriptors for LD and GD on x86 and x86-64.
  Implemented by Alexandre Oliva.

* getaddrinfo now handles DCCP and UDPlite.
  Implemented by Ulrich Drepper.

* New fixed-size conversion macros: htobe16, htole16, be16toh, le16toh,
  htobe32, htole32, be32toh, le32toh, htobe64, htole64, be64toh, le64toh.
  Implemented by Ulrich Drepper.

* New implementation of memmem, strstr, and strcasestr which is O(n).
  Implemented by Eric Blake.

* New Linux interfaces: inotify_init1, paccept, dup3, epoll_create2, pipe2

* Implement "e" option for popen to open file descriptor with the
  close-on-exec flag set

* Many functions, exported and internal, now atomically set the close-on-exec
  flag when run on a sufficiently new kernel.  Implemented by Ulrich Drepper.
\f
Version 2.8

* New locales: bo_CN, bo_IN, shs_CA.

* New encoding: HP-ROMAN9, HP-GREEK8, HP-THAI8, HP-TURKISH8.

* Sorting rules for some Indian languages (Devanagari and Gujarati).
  Implemented by Pravin Satpute.

* IPV6 addresses in /etc/resolv.conf can now have a scope ID

* nscd caches now all timeouts for DNS entries
  Implemented by Ulrich Drepper.

* nscd is more efficient and wakes up less often.
  Implemented by Ulrich Drepper.

* More checking functions: asprintf, dprintf, obstack_printf, vasprintf,
  vdprintf, and obstack_vprintf.
  Implemented by Jakub Jelinek.

* Faster memset for x86-64.
  Implemented by Harsha Jagasia and H.J. Lu.

* Faster memcpy on x86.
  Implemented by Ulrich Drepper.

* ARG_MAX is not anymore constant on Linux.  Use sysconf(_SC_ARG_MAX).
  Implemented by Ulrich Drepper.

* Faster sqrt and sqrtf implemention for some PPC variants.
  Implemented by Stephen Munroe.
\f
Version 2.7

* More checking functions: fread, fread_unlocked, open*, mq_open.
  Implemented by Jakub Jelinek and Ulrich Drepper.

* Extend fortification to C++.  Implemented by Jakub Jelinek.

* Implement 'm' modifier for scanf.  Add stricter C99/SUS compliance
  by not recognizing 'a' as a modifier when those specs are requested.
  Implemented by Jakub Jelinek.

* PPC optimizations to math and string functions.
  Implemented by Steven Munroe.

* New interfaces: mkostemp, mkostemp64.  Like mkstemp* but allow additional
  options to be passed.  Implemented by Ulrich Drepper.

* More CPU set manipulation functions.  Implemented by Ulrich Drepper.

* New Linux interfaces: signalfd, eventfd, eventfd_read, and eventfd_write.
  Implemented by Ulrich Drepper.

* Handle private futexes in the NPTL implementation.
  Implemented by Jakub Jelinek and Ulrich Drepper.

* Add support for O_CLOEXEC.  Implement in Hurd.  Use throughout libc.
  Implemented by Roland McGrath and Ulrich Drepper.

* Linux/x86-64 vDSO support.  Implemented by Ulrich Drepper.

* SHA-256 and SHA-512 based password encryption.
  Implemented by Ulrich Drepper.

* New locales: ber_DZ, ber_MA, en_NG, fil_PH, fur_IT, fy_DE, ha_NG, ig_NG,
  ik_CA, iu_CA, li_BE, li_NL, nds_DE, nds_NL, pap_AN, sc_IT, tk_TM, ug_CN,
  yo_NG.

+ New iconv modules: MAC-CENTRALEUROPE, ISO-8859-9E, KOI8-RU.
  Implemented by Ulrich Drepper.
\f
Version 2.6

* New Linux interfaces: epoll_pwait, sched_getcpu.

* New generic interfaces: strerror_l.

* nscd can now cache the services database.   Implemented by Ulrich Drepper.

etc etc etc

Dennis

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: gcc will become the best optimizing x86 compiler
@ 2008-07-30 15:34 Eus
  2008-07-30 16:09 ` Dennis Clarke
  0 siblings, 1 reply; 50+ messages in thread
From: Eus @ 2008-07-30 15:34 UTC (permalink / raw)
  To: Dennis Clarke; +Cc: GCC Development Mailing List

Hi Ho!

--- On Tue, 7/29/08, "Dennis Clarke" <blastwave@gmail.com> wrote:

> hold on .. on the NEWS page I see ... okay .. how very user friendly.
> Sort of the thing one would put on the project homepage I would think.

Do you mind to tell me what you saw?
I was looking for the interesting part on the latest release of the NEWS on the CVS but to no avail.

Thank you for your help.

> Dennis

Best regards,
Eus


      

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2008-07-31  9:36 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-07-23 14:59 Is cross-section inlining valid behaviour? Bingfeng Mei
2008-07-23 15:07 ` Richard Guenther
2008-07-23 15:14 ` Dave Korn
2008-07-23 15:31   ` Bingfeng Mei
2008-07-23 17:25 ` gcc will become the best optimizing x86 compiler Agner Fog
2008-07-23 17:33   ` Tim Prince
2008-07-24  8:04   ` Dennis Clarke
2008-07-24  9:41     ` Agner Fog
2008-07-24 10:10       ` Dave Korn
2008-07-24 13:20         ` Basile STARYNKEVITCH
2008-07-24 13:31           ` Dave Korn
2008-07-24 13:59           ` Agner Fog
2008-07-24 14:40             ` Richard Guenther
2008-07-28 10:57             ` Andrew Haley
2008-07-24 15:02           ` Joseph S. Myers
2008-07-24 16:26             ` Agner Fog
2008-07-24 17:17             ` Basile STARYNKEVITCH
2008-07-24 17:21       ` Raksit Ashok
2008-07-25  7:23         ` Agner Fog
2008-07-26  0:23           ` Michael Meissner
2008-07-26 17:49             ` Agner Fog
2008-07-28 11:45             ` Agner Fog
2008-07-28 14:40               ` Daniel Jacobowitz
2008-07-28 17:37                 ` Dennis Clarke
2008-07-28 17:54                   ` Paolo Carlini
2008-07-28 18:31                     ` Dennis Clarke
2008-07-28 18:37                       ` Ian Lance Taylor
2008-07-28 19:44                       ` Dave Korn
2008-07-28 21:40                         ` Dennis Clarke
2008-07-29  1:31                       ` Gerald Pfeifer
2008-07-29  6:29                         ` Agner Fog
2008-07-29  9:24                           ` Ben Elliston
2008-07-31  8:12                             ` Christopher Faylor
2008-07-28 17:19               ` Michael Matz
2008-07-29  6:15                 ` Agner Fog
2008-07-29  9:31                   ` Richard Guenther
2008-07-29  9:55                     ` Steven Bosscher
2008-07-29 13:09                       ` Joseph S. Myers
2008-07-29 14:11                   ` Michael Matz
2008-07-29 14:45                   ` Tim Prince
2008-07-30 16:37           ` Denys Vlasenko
2008-07-30 16:40             ` Denys Vlasenko
2008-07-30 17:52               ` Agner Fog
2008-07-30 22:42                 ` Dennis Clarke
2008-07-31  2:57                 ` Denys Vlasenko
2008-07-31  8:18                   ` Agner Fog
2008-07-31 11:00                     ` Dave Korn
2008-07-24 10:09   ` Zoltán Kócsi
2008-07-30 15:34 Eus
2008-07-30 16:09 ` Dennis Clarke

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).