* Old compiler optimizations in installed headers
@ 2015-05-22 17:06 Joseph Myers
2015-05-22 21:36 ` Paul Eggert
` (3 more replies)
0 siblings, 4 replies; 11+ messages in thread
From: Joseph Myers @ 2015-05-22 17:06 UTC (permalink / raw)
To: libc-alpha
We've recently had a discussion of whether it makes sense to keep various
macros and inline functions in bits/string2.h that are only used for old
GCC versions and only serve to optimize cases of small / constant
arguments with those versions. A similar issue applies to such things as
inline __signbit* definitions in bits/mathinline.h - given
<https://sourceware.org/ml/libc-alpha/2015-05/msg00521.html> could we just
remove the various inlines on the basis that optimization for compilers
before GCC 4.0 isn't a concern - and probably to various other inlines.
There was previously a discussion with a proposal in
<https://sourceware.org/ml/libc-alpha/2013-01/msg00157.html> and with
<https://sourceware.org/ml/libc-alpha/2013-01/msg00270.html> saying to
avoid optimization regressions for old compilers.
Regarding what's supported at all, the baseline is C90 or C++98 plus long
long support (although there are various places where headers in fact
depend on other extensions to provide particular functionality).
Meanwhile, lots of architecture-specific .S function implementations in
NPTL (especially, but maybe also elsewhere) have been removed, with
benchmark evidence showing they don't actually provide better performance
but do cause significant trouble in maintenance.
I'd like to propose that:
(a) if an optimization could clearly be done in the compiler - if it only
depends on the standard semantics of standard functions, or fully-defined
semantics of glibc functions that it would be reasonable to encode into
GCC, rather than e.g. generating calls to a glibc-internal function - then
we should be wary of adding it to glibc's headers in the first place: in
accordance with principles of GNU projects working together, it's better
to add the optimization to GCC; and
(b) where such optimizations are present in glibc headers and only
relevant for GCC versions before some baseline (maybe 4.1 or 4.3), we
should be willing to remove them to simplify the code and remove variants
that aren't covered by normal glibc testing at all, without being
concerned about worse code possibly being generated for users of old
compilers. Applying that principle to signbit and is* macros would allow
complete removal of some mathinline.h implementations and removal of
significant amounts of code from others.
--
Joseph S. Myers
joseph@codesourcery.com
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Old compiler optimizations in installed headers
2015-05-22 17:06 Old compiler optimizations in installed headers Joseph Myers
@ 2015-05-22 21:36 ` Paul Eggert
2015-05-22 22:10 ` Joseph Myers
2015-05-22 22:19 ` Roland McGrath
` (2 subsequent siblings)
3 siblings, 1 reply; 11+ messages in thread
From: Paul Eggert @ 2015-05-22 21:36 UTC (permalink / raw)
To: Joseph Myers, libc-alpha
On 05/22/2015 08:03 AM, Joseph Myers wrote:
> (b) where such optimizations are present in glibc headers and only
> relevant for GCC versions before some baseline (maybe 4.1 or 4.3), we
> should be willing to remove them to simplify the code
This all sounds reasonable. How much of a maintenance difference would
it be to select 4.1 vs 4.3 for the baseline? If that's significant, the
last GCC 4.3.x release was in 2009 which is quite a while ago if we're
talking about development environments, so I suggest going with 4.3.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Old compiler optimizations in installed headers
2015-05-22 21:36 ` Paul Eggert
@ 2015-05-22 22:10 ` Joseph Myers
0 siblings, 0 replies; 11+ messages in thread
From: Joseph Myers @ 2015-05-22 22:10 UTC (permalink / raw)
To: Paul Eggert; +Cc: libc-alpha
On Fri, 22 May 2015, Paul Eggert wrote:
> On 05/22/2015 08:03 AM, Joseph Myers wrote:
> > (b) where such optimizations are present in glibc headers and only
> > relevant for GCC versions before some baseline (maybe 4.1 or 4.3), we
> > should be willing to remove them to simplify the code
>
> This all sounds reasonable. How much of a maintenance difference would it be
> to select 4.1 vs 4.3 for the baseline? If that's significant, the last GCC
> 4.3.x release was in 2009 which is quite a while ago if we're talking about
> development environments, so I suggest going with 4.3.
There are some 4.3 conditionals in byteswap.h headers. Other than that it
doesn't look like there would be much advantage in 4.3 over 4.0 (3.4 would
be sufficient for bits/string2.h, 4.0 for bits/mathinline.h on various
architectures).
--
Joseph S. Myers
joseph@codesourcery.com
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Old compiler optimizations in installed headers
2015-05-22 17:06 Old compiler optimizations in installed headers Joseph Myers
2015-05-22 21:36 ` Paul Eggert
@ 2015-05-22 22:19 ` Roland McGrath
2015-05-24 17:10 ` Ondřej Bílka
2015-05-25 2:37 ` Ondřej Bílka
3 siblings, 0 replies; 11+ messages in thread
From: Roland McGrath @ 2015-05-22 22:19 UTC (permalink / raw)
To: Joseph Myers; +Cc: libc-alpha
I think that all makes perfect sense. Starting with 4.1 or 4.3 is a
completely reasonable conservative starting place.
I think we should consider the minimum GCC version required for
building libc itself as the only clear upper bound on the minimum
GCC version for which we bother to maintain any optimizations in
headers. Keeping the minimum for header optimizations at a lower
version requires continual proof that it matters to anyone, any time
it looks like raising it would ease the maintenance burden.
Thanks,
Roland
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Old compiler optimizations in installed headers
2015-05-22 17:06 Old compiler optimizations in installed headers Joseph Myers
2015-05-22 21:36 ` Paul Eggert
2015-05-22 22:19 ` Roland McGrath
@ 2015-05-24 17:10 ` Ondřej Bílka
2015-05-27 20:04 ` Richard Henderson
2015-05-25 2:37 ` Ondřej Bílka
3 siblings, 1 reply; 11+ messages in thread
From: Ondřej Bílka @ 2015-05-24 17:10 UTC (permalink / raw)
To: Joseph Myers; +Cc: libc-alpha
On Fri, May 22, 2015 at 03:03:01PM +0000, Joseph Myers wrote:
> We've recently had a discussion of whether it makes sense to keep various
> macros and inline functions in bits/string2.h that are only used for old
> GCC versions and only serve to optimize cases of small / constant
> arguments with those versions. A similar issue applies to such things as
> inline __signbit* definitions in bits/mathinline.h - given
> <https://sourceware.org/ml/libc-alpha/2015-05/msg00521.html> could we just
> remove the various inlines on the basis that optimization for compilers
> before GCC 4.0 isn't a concern - and probably to various other inlines.
>
While good idea in general I would be wary of signbit et al. macros.
Using gcc will cause performance regression in some cases as they
generate suboptimal branchless code. Adding branches is correct as
performance savings from branchless code are in branch misprediction
which doesn't happen. If you have more than 5% inputs NaN to offset
single cycle penalty of going branchless your problem is that you are
producing garbage instead performance.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Old compiler optimizations in installed headers
2015-05-22 17:06 Old compiler optimizations in installed headers Joseph Myers
` (2 preceding siblings ...)
2015-05-24 17:10 ` Ondřej Bílka
@ 2015-05-25 2:37 ` Ondřej Bílka
2015-05-28 16:43 ` Joseph Myers
3 siblings, 1 reply; 11+ messages in thread
From: Ondřej Bílka @ 2015-05-25 2:37 UTC (permalink / raw)
To: Joseph Myers; +Cc: libc-alpha
On Fri, May 22, 2015 at 03:03:01PM +0000, Joseph Myers wrote:
> We've recently had a discussion of whether it makes sense to keep various
> macros and inline functions in bits/string2.h that are only used for old
> GCC versions and only serve to optimize cases of small / constant
> arguments with those versions. A similar issue applies to such things as
> inline __signbit* definitions in bits/mathinline.h - given
> <https://sourceware.org/ml/libc-alpha/2015-05/msg00521.html> could we just
> remove the various inlines on the basis that optimization for compilers
> before GCC 4.0 isn't a concern - and probably to various other inlines.
>
> There was previously a discussion with a proposal in
> <https://sourceware.org/ml/libc-alpha/2013-01/msg00157.html> and with
> <https://sourceware.org/ml/libc-alpha/2013-01/msg00270.html> saying to
> avoid optimization regressions for old compilers.
>
> Regarding what's supported at all, the baseline is C90 or C++98 plus long
> long support (although there are various places where headers in fact
> depend on other extensions to provide particular functionality).
>
> Meanwhile, lots of architecture-specific .S function implementations in
> NPTL (especially, but maybe also elsewhere) have been removed, with
> benchmark evidence showing they don't actually provide better performance
> but do cause significant trouble in maintenance.
>
> I'd like to propose that:
>
> (a) if an optimization could clearly be done in the compiler - if it only
> depends on the standard semantics of standard functions, or fully-defined
> semantics of glibc functions that it would be reasonable to encode into
> GCC, rather than e.g. generating calls to a glibc-internal function - then
> we should be wary of adding it to glibc's headers in the first place: in
> accordance with principles of GNU projects working together, it's better
> to add the optimization to GCC; and
>
I disagree for simple reason of cost. Its considerably easier to write a
inline function that does transformation than to write equivalent gcc
pass.
Result is that a lot of builtin cause performance regressions. I would
need to review them separately as they don't inspire lot of confidence.
I just found that gcc also seriously messes up strcmp with constant
strings. It uses rep cmpsb which is around three times slower than
libcall in following simple test, complile a.c and b.c separately then
link with main.c.
a.c:
#include <string.h>
int strcmp2(char *c)
{
return strcmp(c, "ahtusntueoahsntsnthusnthueoasnth");
}
b.c:
#include <string.h>
extern char *str;
int strcmp2(char *c)
{
return strcmp(c, str);
}
main.c:
char *str = "ahtusntueoahsntsnthusnthueoasnth";
char *str2 = "bhtusntueoahsntsnthusnthueoasnti";
int main()
{
int i;
for (i=0;i<10000000;i++)
strcmp2 (str2);
}
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Old compiler optimizations in installed headers
2015-05-24 17:10 ` Ondřej Bílka
@ 2015-05-27 20:04 ` Richard Henderson
2015-05-27 20:13 ` Ondřej Bílka
0 siblings, 1 reply; 11+ messages in thread
From: Richard Henderson @ 2015-05-27 20:04 UTC (permalink / raw)
To: Ondřej Bílka, Joseph Myers; +Cc: libc-alpha
On 05/24/2015 06:31 AM, OndÅej BÃlka wrote:
> While good idea in general I would be wary of signbit et al. macros.
> Using gcc will cause performance regression in some cases as they
> generate suboptimal branchless code. Adding branches is correct as
> performance savings from branchless code are in branch misprediction
> which doesn't happen. If you have more than 5% inputs NaN to offset
> single cycle penalty of going branchless your problem is that you are
> producing garbage instead performance.
I beg your pardon? Why would you *ever* have a branch implementing signbit?
r~
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Old compiler optimizations in installed headers
2015-05-27 20:04 ` Richard Henderson
@ 2015-05-27 20:13 ` Ondřej Bílka
0 siblings, 0 replies; 11+ messages in thread
From: Ondřej Bílka @ 2015-05-27 20:13 UTC (permalink / raw)
To: Richard Henderson; +Cc: Joseph Myers, libc-alpha
On Wed, May 27, 2015 at 08:15:51AM -0700, Richard Henderson wrote:
> On 05/24/2015 06:31 AM, OndÅej BÃlka wrote:
> > While good idea in general I would be wary of signbit et al. macros.
> > Using gcc will cause performance regression in some cases as they
> > generate suboptimal branchless code. Adding branches is correct as
> > performance savings from branchless code are in branch misprediction
> > which doesn't happen. If you have more than 5% inputs NaN to offset
> > single cycle penalty of going branchless your problem is that you are
> > producing garbage instead performance.
>
> I beg your pardon? Why would you *ever* have a branch implementing signbit?
>
Sorry, I meant that you must be careful with others like isinf. Not signbit
thats ok.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Old compiler optimizations in installed headers
2015-05-25 2:37 ` Ondřej Bílka
@ 2015-05-28 16:43 ` Joseph Myers
2015-05-29 11:03 ` Ondřej Bílka
0 siblings, 1 reply; 11+ messages in thread
From: Joseph Myers @ 2015-05-28 16:43 UTC (permalink / raw)
To: Ondřej Bílka; +Cc: libc-alpha
[-- Attachment #1: Type: text/plain, Size: 2002 bytes --]
On Mon, 25 May 2015, Ondøej BĂlka wrote:
> > I'd like to propose that:
> >
> > (a) if an optimization could clearly be done in the compiler - if it only
> > depends on the standard semantics of standard functions, or fully-defined
> > semantics of glibc functions that it would be reasonable to encode into
> > GCC, rather than e.g. generating calls to a glibc-internal function - then
> > we should be wary of adding it to glibc's headers in the first place: in
> > accordance with principles of GNU projects working together, it's better
> > to add the optimization to GCC; and
> >
> I disagree for simple reason of cost. Its considerably easier to write a
> inline function that does transformation than to write equivalent gcc
> pass.
The first question should be to determine the right way to implement
something rather than the quick way. And I think the right way is
generally compiler optimization - which (for example) allows for use in
kernel space, for optimization based on the function semantics (e.g. as
regards aliasing) rather than just the semantics of a particular
implementation in a header, and for different expansions depending on
whether the compiler thinks the code in question is hot or cold
(information possibly obtained from profile feedback - much code is
generally cold, so expansions that increase code size should only be used
in those bits of code determined to be hot, which is information simply
not available at all in the headers).
Then, if putting an optimization (or compiler bug workaround, etc.) in
glibc's headers when a compiler approach would also be possible, it should
always be accompanied by a comment pointing to the GCC bug report
requesting the optimization, and the bug should have a comment pointing
back to that glibc header comment and saying to inform the glibc
developers when resolving the bug so they know to insert appropriate
__GNUC_PREREQ conditionals in the header.
--
Joseph S. Myers
joseph@codesourcery.com
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Old compiler optimizations in installed headers
2015-05-28 16:43 ` Joseph Myers
@ 2015-05-29 11:03 ` Ondřej Bílka
2015-05-29 12:00 ` Joseph Myers
0 siblings, 1 reply; 11+ messages in thread
From: Ondřej Bílka @ 2015-05-29 11:03 UTC (permalink / raw)
To: Joseph Myers; +Cc: libc-alpha
On Thu, May 28, 2015 at 04:18:35PM +0000, Joseph Myers wrote:
> On Mon, 25 May 2015, OndÅej BÃlka wrote:
>
> > > I'd like to propose that:
> > >
> > > (a) if an optimization could clearly be done in the compiler - if it only
> > > depends on the standard semantics of standard functions, or fully-defined
> > > semantics of glibc functions that it would be reasonable to encode into
> > > GCC, rather than e.g. generating calls to a glibc-internal function - then
> > > we should be wary of adding it to glibc's headers in the first place: in
> > > accordance with principles of GNU projects working together, it's better
> > > to add the optimization to GCC; and
> > >
> > I disagree for simple reason of cost. Its considerably easier to write a
> > inline function that does transformation than to write equivalent gcc
> > pass.
>
> The first question should be to determine the right way to implement
> something rather than the quick way.
On several simple optimizations both are correct so question is which is
easier to implement. Current gcc shown evidence that compiler
optimization isn't one. You could have performance regression with
almost all functions due to underlying bugs in memcmp and memcpy
generation.
As these are outstanding bugs lets be practical here. Joseph how long do
you think it will take to fix them. As these are present for five years
I wouldn't be surprised to wait another five years.
So I would make deadline of three months, if gcc cannot produce a patch
within that going compiler optimization way does take too long.
> And I think the right way is
> generally compiler optimization - which (for example) allows for use in
> kernel space, for optimization based on the function semantics (e.g. as
For kernel space you could just surround these with #ifdef _GCC_USE_SSE
or equivalent.
> regards aliasing) rather than just the semantics of a particular
> implementation in a header, and for different expansions depending on
> whether the compiler thinks the code in question is hot or cold
> (information possibly obtained from profile feedback - much code is
> generally cold, so expansions that increase code size should only be used
> in those bits of code determined to be hot, which is information simply
> not available at all in the headers).
>
You shouldn't use pattern: Something needs to be done. X is something.
So X should be done.
As you mentioned profiling you would need userspace based profiling
instead of generic one by gcc. You could access userspace counters from
header.
As you mentioned hotness/coldness first you need make gcc measure real
thing which are number icache misses instead of trying to guess
hotness/coldness from frequencies. A code in tight loop with high
iteration count is hot no matter how rarely its executed.
Then as you said that expansion that increases code size should only be
used in hot bits of code you are wrong again.
You need to also do profiling of library function and don't do
ransformation only when function is cache resident. If its not then
expansion would improve performance as you need to fetch only several
bytes into icache instead wasting time on fetching whole function to
cache.
Then you have problem that optimizing for size is missnomer. You do
optimization knowing that each byte in code carries some penalty in
cache misses. Then for optimizations you need to check if they are
cost-effective instead broad generalization of hot/cold increases size
or not.
Then as I mentioned userspace profiling you need to collect correct data
to get optimization. When we keep focus on strcmp/memcmp its question if
inlining first byte check helps.
#define strcmp(x, y) ({ \
profile.iterations++; \
if (x[0] - y[0]) \
(x[0] - y[0]); \
else \
{ \
profile.second_byte++; \
...
From my profiling this helps for strcmp and strncmp. However its mistake
for memcmp, which is used to compare structures so mismatch likely
occurs much later. That also means that you discard information my
conversion of strcmp->memcmp unless you do profiling.
> Then, if putting an optimization (or compiler bug workaround, etc.) in
> glibc's headers when a compiler approach would also be possible, it should
> always be accompanied by a comment pointing to the GCC bug report
> requesting the optimization, and the bug should have a comment pointing
> back to that glibc header comment and saying to inform the glibc
> developers when resolving the bug so they know to insert appropriate
> __GNUC_PREREQ conditionals in the header.
>
Of course that I will describe bugs.
Also you have other problem with gcc/header issue. I was asked if its
possible to use functions with partially expanded header, so you would
call new symbol like memcmp_aligned to save cost of checking header
again. I don't think that adding a expansion in gcc is sane.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Old compiler optimizations in installed headers
2015-05-29 11:03 ` Ondřej Bílka
@ 2015-05-29 12:00 ` Joseph Myers
0 siblings, 0 replies; 11+ messages in thread
From: Joseph Myers @ 2015-05-29 12:00 UTC (permalink / raw)
To: Ondřej Bílka; +Cc: libc-alpha
[-- Attachment #1: Type: text/plain, Size: 3291 bytes --]
On Fri, 29 May 2015, OndÅej BÃlka wrote:
> On several simple optimizations both are correct so question is which is
> easier to implement. Current gcc shown evidence that compiler
> optimization isn't one. You could have performance regression with
> almost all functions due to underlying bugs in memcmp and memcpy
> generation.
In most cases you haven't given enough information (such as test cases,
compiler options and confirmation this is using current GCC 6) to convince
me that there are actually problems (an awful lot of people complaining
about compiler performance are using inappropriate options or compiler
configurations).
> As these are outstanding bugs lets be practical here. Joseph how long do
> you think it will take to fix them. As these are present for five years
> I wouldn't be surprised to wait another five years.
>
> So I would make deadline of three months, if gcc cannot produce a patch
> within that going compiler optimization way does take too long.
I don't think you can meaningfully count time without starting with a
constructive proposal in the right place. That means bugs in GCC Bugzilla
for each issue, constructively engaging with the points of view other
people present in response to those bugs, a meta-bug depending on all such
bugs, and starting a discussion on the GCC mailing list pointing to that
bug and offering your advice and expertise on e.g. how common different
sorts of string function inputs are in different workloads. The message
in such discussions should be one of seeking collaborators to improve
string function optimization in the GNU system as a whole, not "X sucks"
or "only this particular form of benchmarking is valid" or "changing Y is
easier than changing Z".
> Also you have other problem with gcc/header issue. I was asked if its
> possible to use functions with partially expanded header, so you would
> call new symbol like memcmp_aligned to save cost of checking header
> again. I don't think that adding a expansion in gcc is sane.
It seems perfectly reasonable for glibc to define *reserved-namespace*
ABIs for such cases that GCC can then generate calls to (noting that GCC
often has compile-time knowledge of alignment). Indeed, the ARM EABI
defines functions such as __aeabi_memcpy4 and __aeabi_memclr8 for aligned
inputs - right now, those functions in glibc are wrappers for the generic
ones, so slower than the generic ones, and GCC doesn't generate calls to
them. Changing just one without the other wouldn't be that useful - but
as a collaboration, making the functions faster in glibc *and* making GCC
generate calls using the alignment information it has, given new enough
glibc on the target, would make sense.
Again, the principle is to work out collaboratively what's best for the
GNU system as a whole (respecting that different people use the system in
different ways and what's useful for one person may be problematic for
others, so there is a need to compromise with other requirements than just
the greatest optimization of particular programs), and to work as needed
with experts on each part of the system to achieve those goals. Not to
focus only on the part of the system you're most familiar with.
--
Joseph S. Myers
joseph@codesourcery.com
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2015-05-29 11:03 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-22 17:06 Old compiler optimizations in installed headers Joseph Myers
2015-05-22 21:36 ` Paul Eggert
2015-05-22 22:10 ` Joseph Myers
2015-05-22 22:19 ` Roland McGrath
2015-05-24 17:10 ` Ondřej Bílka
2015-05-27 20:04 ` Richard Henderson
2015-05-27 20:13 ` Ondřej Bílka
2015-05-25 2:37 ` Ondřej Bílka
2015-05-28 16:43 ` Joseph Myers
2015-05-29 11:03 ` Ondřej Bílka
2015-05-29 12:00 ` Joseph Myers
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).