[Bug tree-optimization/94617] New: Simple if condition not optimized

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/94617] New: Simple if condition not optimized
@ 2020-04-16 11:28 soap at gentoo dot org
  2020-04-16 11:57 ` [Bug tree-optimization/94617] " rguenth at gcc dot gnu.org
                   ` (10 more replies)
  0 siblings, 11 replies; 12+ messages in thread
From: soap at gentoo dot org @ 2020-04-16 11:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94617

            Bug ID: 94617
           Summary: Simple if condition not optimized
           Product: gcc
           Version: 10.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: soap at gentoo dot org
  Target Milestone: ---

Given the following C++ snippet

  const char* vanilla_bandpass(int a, int b, int x, const char* low, const
char* high)
  {
      const bool within_interval { (a <= x) && (x < b) };
      return (within_interval ? high : low);
  }

GCC trunk yields with -O3 -march=znver2 the following assembly

  vanilla_bandpass(int, int, int, char const*, char const*):
          mov     rax, r8
          cmp     edi, edx
          jg      .L4
          cmp     edx, esi
          jge     .L4
          ret
  .L4:
          mov     rax, rcx
          ret

which is terrible. On the other hand, Clang emits

  vanilla_bandpass(int, int, int, char const*, char const*):
          cmp     edx, esi
          cmovge  r8, rcx
          cmp     edi, edx
          cmovg   r8, rcx
          mov     rax, r8
          ret

which is a lot better. There exists an unbranched version for which I'm not
100% certain whether it's free of UB:

  #include <cstdint>

  const char* funky_bandpass(int a, int b, int x, const char* low, const char*
high)
  {
      const bool within_interval { (a <= x) && (x < b) };
      const auto low_ptr = reinterpret_cast<uintptr_t>(low) *
(!within_interval);
      const auto high_ptr = reinterpret_cast<uintptr_t>(high) *
within_interval;

      const auto ptr_sum = low_ptr + high_ptr;
      const auto* result = reinterpret_cast<const char*>(ptr_sum);
      return result;
  }

which yields

  funky_bandpass(int, int, int, char const*, char const*):
          cmp     edi, edx
          setle   al
          cmp     edx, esi
          setl    dl
          and     eax, edx
          mov     edx, eax
          xor     edx, 1
          movzx   edx, dl
          movzx   eax, al
          imul    rcx, rdx
          imul    rax, r8
          add     rax, rcx
          ret

which is jump-free and in practice executes at the same observable rate as
Clang's assembly, but still looks needlessly complex. Clang manages to compile
this code to the same assembly as vanilla_bandpass.

Any chance of getting the optimizer ironed out for this?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug tree-optimization/94617] Simple if condition not optimized
  2020-04-16 11:28 [Bug tree-optimization/94617] New: Simple if condition not optimized soap at gentoo dot org
@ 2020-04-16 11:57 ` rguenth at gcc dot gnu.org
  2020-04-16 12:02 ` soap at gentoo dot org
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-04-16 11:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94617

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Why do you think this is terrible?  Aggressive use of conditional moves is not
a good idea in general.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug tree-optimization/94617] Simple if condition not optimized
  2020-04-16 11:28 [Bug tree-optimization/94617] New: Simple if condition not optimized soap at gentoo dot org
  2020-04-16 11:57 ` [Bug tree-optimization/94617] " rguenth at gcc dot gnu.org
@ 2020-04-16 12:02 ` soap at gentoo dot org
  2020-04-16 12:10 ` rguenth at gcc dot gnu.org
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: soap at gentoo dot org @ 2020-04-16 12:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94617

--- Comment #2 from David Seifert <soap at gentoo dot org> ---
(In reply to Richard Biener from comment #1)
> Why do you think this is terrible?  Aggressive use of conditional moves is
> not
> a good idea in general.

I've benchmarked the code, and on a Zen 2 the GCC produced code is 1.8x slower
than the Clang code. My gut feeling is the jump instruction killing
performance. I haven't tried it on an older CPU, but I can give it a try there
too.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug tree-optimization/94617] Simple if condition not optimized
  2020-04-16 11:28 [Bug tree-optimization/94617] New: Simple if condition not optimized soap at gentoo dot org
  2020-04-16 11:57 ` [Bug tree-optimization/94617] " rguenth at gcc dot gnu.org
  2020-04-16 12:02 ` soap at gentoo dot org
@ 2020-04-16 12:10 ` rguenth at gcc dot gnu.org
  2020-04-16 12:17 ` rguenth at gcc dot gnu.org
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-04-16 12:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94617

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
Note the RTL if-conversion pass doesn't recognize what we present to it.
If you alter initial RTL expansion via -fno-tree-ter (not recommended in
general)
we produce a more 1:1 translation of the code as

_Z16vanilla_bandpassiiiPKcS0_:
.LFB0:
        .cfi_startproc
        movl    %esi, %r9d
        cmpl    %edx, %edi
        setle   %sil
        cmpl    %r9d, %edx
        setl    %al
        testb   %al, %sil
        movq    %rcx, %rax
        cmovne  %r8, %rax
        ret

which I wouldn't expect to execute faster though.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug tree-optimization/94617] Simple if condition not optimized
  2020-04-16 11:28 [Bug tree-optimization/94617] New: Simple if condition not optimized soap at gentoo dot org
                   ` (2 preceding siblings ...)
  2020-04-16 12:10 ` rguenth at gcc dot gnu.org
@ 2020-04-16 12:17 ` rguenth at gcc dot gnu.org
  2020-04-16 12:18 ` soap at gentoo dot org
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-04-16 12:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94617

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
IIRC you can also express the range test this way:

const char* vanilla_bandpass(int a, int b, int x, const char* low, const char*
high)
{
  const bool within_interval { (unsigned long)x - a < (unsigned long)b - a };
  return (within_interval ? high : low);
}

which gives you

_Z16vanilla_bandpassiiiPKcS0_:
.LFB0:
        .cfi_startproc
        movslq  %edi, %rdi
        movslq  %edx, %rdx
        movslq  %esi, %rsi
        movq    %rcx, %rax
        subq    %rdi, %rdx
        subq    %rdi, %rsi
        cmpq    %rsi, %rdx
        cmovb   %r8, %rax
        ret

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug tree-optimization/94617] Simple if condition not optimized
  2020-04-16 11:28 [Bug tree-optimization/94617] New: Simple if condition not optimized soap at gentoo dot org
                   ` (3 preceding siblings ...)
  2020-04-16 12:17 ` rguenth at gcc dot gnu.org
@ 2020-04-16 12:18 ` soap at gentoo dot org
  2020-04-16 12:47 ` jakub at gcc dot gnu.org
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: soap at gentoo dot org @ 2020-04-16 12:18 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94617

--- Comment #5 from David Seifert <soap at gentoo dot org> ---
(In reply to Richard Biener from comment #3)
> Note the RTL if-conversion pass doesn't recognize what we present to it.
> If you alter initial RTL expansion via -fno-tree-ter (not recommended in
> general)
> we produce a more 1:1 translation of the code as
> 
> _Z16vanilla_bandpassiiiPKcS0_:
> .LFB0:
>         .cfi_startproc
>         movl    %esi, %r9d
>         cmpl    %edx, %edi
>         setle   %sil
>         cmpl    %r9d, %edx
>         setl    %al
>         testb   %al, %sil
>         movq    %rcx, %rax
>         cmovne  %r8, %rax
>         ret
> 
> which I wouldn't expect to execute faster though.

just benchmarked the code on an oldish Ivybridge. GCC with vanilla_bandpass is
2.1x slower than GCC with funky_bandpass, and GCC with funky_bandpass is 12%
slower than Clang with vanilla_bandpass.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug tree-optimization/94617] Simple if condition not optimized
  2020-04-16 11:28 [Bug tree-optimization/94617] New: Simple if condition not optimized soap at gentoo dot org
                   ` (4 preceding siblings ...)
  2020-04-16 12:18 ` soap at gentoo dot org
@ 2020-04-16 12:47 ` jakub at gcc dot gnu.org
  2020-04-16 13:11 ` rguenth at gcc dot gnu.org
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: jakub at gcc dot gnu.org @ 2020-04-16 12:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94617

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org

--- Comment #6 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
(In reply to David Seifert from comment #5)
> just benchmarked the code on an oldish Ivybridge. GCC with vanilla_bandpass
> is 2.1x slower than GCC with funky_bandpass, and GCC with funky_bandpass is
> 12% slower than Clang with vanilla_bandpass.

It really depends, conditional moves are really badly implemented in x86 and
sometimes they do improve speed and at other times they slow things down by
huge amounts, which mostly depends on how well the CPU can predict branches if
there are (if well, then branches are significantly faster, if they are very
badly predictable, then conditional moves are faster).

As for turning it into
(unsigned)x - a < (unsigned)b - a
reassoc has
optimize_range_tests_var_bound
which right now handles
   a >= 0 && a < b into (unsigned) a < (unsigned) b
   a >= 0 && a <= b into (unsigned) a <= (unsigned) b
Now
   a >= b && a < c into (unsigned) a - b < (unsigned) c - b
isn't equivalent to that, e.g. if b is 5 and c 4.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug tree-optimization/94617] Simple if condition not optimized
  2020-04-16 11:28 [Bug tree-optimization/94617] New: Simple if condition not optimized soap at gentoo dot org
                   ` (5 preceding siblings ...)
  2020-04-16 12:47 ` jakub at gcc dot gnu.org
@ 2020-04-16 13:11 ` rguenth at gcc dot gnu.org
  2021-07-19  3:40 ` [Bug target/94617] " pinskia at gcc dot gnu.org
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-04-16 13:11 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94617

--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Jakub Jelinek from comment #6)
> (In reply to David Seifert from comment #5)
> > just benchmarked the code on an oldish Ivybridge. GCC with vanilla_bandpass
> > is 2.1x slower than GCC with funky_bandpass, and GCC with funky_bandpass is
> > 12% slower than Clang with vanilla_bandpass.
> 
> It really depends, conditional moves are really badly implemented in x86 and
> sometimes they do improve speed and at other times they slow things down by
> huge amounts, which mostly depends on how well the CPU can predict branches
> if there are (if well, then branches are significantly faster, if they are
> very badly predictable, then conditional moves are faster).
> 
> As for turning it into
> (unsigned)x - a < (unsigned)b - a
> reassoc has
> optimize_range_tests_var_bound
> which right now handles
>    a >= 0 && a < b into (unsigned) a < (unsigned) b
>    a >= 0 && a <= b into (unsigned) a <= (unsigned) b
> Now
>    a >= b && a < c into (unsigned) a - b < (unsigned) c - b
> isn't equivalent to that, e.g. if b is 5 and c 4.

sure, there are extra constraints that a <= b and the unsigned type
has to have 1 bit more precision.  So if we would want to do this
transform then we'd need symbolic range info constraining a and b
appropriately.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug target/94617] Simple if condition not optimized
  2020-04-16 11:28 [Bug tree-optimization/94617] New: Simple if condition not optimized soap at gentoo dot org
                   ` (6 preceding siblings ...)
  2020-04-16 13:11 ` rguenth at gcc dot gnu.org
@ 2021-07-19  3:40 ` pinskia at gcc dot gnu.org
  2022-11-26 20:43 ` pinskia at gcc dot gnu.org
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-07-19  3:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94617

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|tree-optimization           |target
             Target|                            |x86_64-linux-gnu

--- Comment #8 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Note only vanilla_bandpass produces the best code on aarch64:
        cmp     w0, w2
        ccmp    w2, w1, 0, le
        csel    x0, x3, x4, ge

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug target/94617] Simple if condition not optimized
  2020-04-16 11:28 [Bug tree-optimization/94617] New: Simple if condition not optimized soap at gentoo dot org
                   ` (7 preceding siblings ...)
  2021-07-19  3:40 ` [Bug target/94617] " pinskia at gcc dot gnu.org
@ 2022-11-26 20:43 ` pinskia at gcc dot gnu.org
  2023-06-25 22:29 ` pinskia at gcc dot gnu.org
  2023-06-25 22:45 ` pinskia at gcc dot gnu.org
  10 siblings, 0 replies; 12+ messages in thread
From: pinskia at gcc dot gnu.org @ 2022-11-26 20:43 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94617

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2022-11-26
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |NEW

--- Comment #9 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
I think this is a side effect of what I mentioned in PR 107642 :
"Note this is not exactly true as there is code which does that again in
dojump.cc:"

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug target/94617] Simple if condition not optimized
  2020-04-16 11:28 [Bug tree-optimization/94617] New: Simple if condition not optimized soap at gentoo dot org
                   ` (8 preceding siblings ...)
  2022-11-26 20:43 ` pinskia at gcc dot gnu.org
@ 2023-06-25 22:29 ` pinskia at gcc dot gnu.org
  2023-06-25 22:45 ` pinskia at gcc dot gnu.org
  10 siblings, 0 replies; 12+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-06-25 22:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94617

--- Comment #10 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
For the first function (vanilla_bandpass), GCC 12+ produces now:
        movq    %rcx, %rax
        cmpl    %edx, %edi
        jg      .L2
        cmpl    %esi, %edx
        cmovl   %r8, %rax
.L2:
        ret


For the second the problem is:
  _3 = ~within_interval_11;
  # RANGE [irange] long unsigned int [0, 1] NONZERO 0x1
  _4 = (long unsigned intD.16) _3;
...
  low_ptr_13 = _4 * low.0_5;
  # RANGE [irange] long unsigned int [0, 1] NONZERO 0x1
  _6 = (long unsigned intD.16) within_interval_11;
...
  high_ptr_15 = _6 * high.1_7;
  ptr_sum_16 = low_ptr_13 + high_ptr_15;

We should recongize that ptr_sum_16 is really just within_interval_11 ?
high.1_7 : low.0_5;

Match pattern for that:
```
(for op (plus bit_ior bit_xor)
 (simplify
  (op:c (mult:c (convert @0) @1)
        (mult:c (convert (bit_not @0)) @2))
  (if (INTEGRAL_TYPE_P (TREE_TYPE (@0))
      && TYPE_PRECISION (TREE_TYPE (@0)) == 1
      && TYPE_UNSIGNED (TREE_TYPE (@0)))
  (cond @0 @1 @2))
```

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug target/94617] Simple if condition not optimized
  2020-04-16 11:28 [Bug tree-optimization/94617] New: Simple if condition not optimized soap at gentoo dot org
                   ` (9 preceding siblings ...)
  2023-06-25 22:29 ` pinskia at gcc dot gnu.org
@ 2023-06-25 22:45 ` pinskia at gcc dot gnu.org
  10 siblings, 0 replies; 12+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-06-25 22:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94617

--- Comment #11 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Note funky_bandpass in GCC 13+ no longer produces an imull but rather does
neg/and instead:
        movl    %edx, %eax
        cmpl    %edx, %edi
        setle   %dl
        cmpl    %esi, %eax
        setl    %al
        andl    %eax, %edx
        cmpb    $1, %dl
        movzbl  %dl, %edx
        sbbq    %rax, %rax
        negq    %rdx
        andq    %rax, %rcx
        andq    %rdx, %r8
        leaq    (%rcx,%r8), %rax

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2023-06-25 22:45 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-16 11:28 [Bug tree-optimization/94617] New: Simple if condition not optimized soap at gentoo dot org
2020-04-16 11:57 ` [Bug tree-optimization/94617] " rguenth at gcc dot gnu.org
2020-04-16 12:02 ` soap at gentoo dot org
2020-04-16 12:10 ` rguenth at gcc dot gnu.org
2020-04-16 12:17 ` rguenth at gcc dot gnu.org
2020-04-16 12:18 ` soap at gentoo dot org
2020-04-16 12:47 ` jakub at gcc dot gnu.org
2020-04-16 13:11 ` rguenth at gcc dot gnu.org
2021-07-19  3:40 ` [Bug target/94617] " pinskia at gcc dot gnu.org
2022-11-26 20:43 ` pinskia at gcc dot gnu.org
2023-06-25 22:29 ` pinskia at gcc dot gnu.org
2023-06-25 22:45 ` pinskia at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).