[Bug other/113682] New: Branches in branchless binary search rather than cmov/csel/csinc

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug other/113682] New: Branches in branchless binary search rather than cmov/csel/csinc
@ 2024-01-31 13:48 redbeard0531 at gmail dot com
  2024-01-31 14:16 ` [Bug rtl-optimization/113682] " rguenth at gcc dot gnu.org
                   ` (8 more replies)
  0 siblings, 9 replies; 10+ messages in thread
From: redbeard0531 at gmail dot com @ 2024-01-31 13:48 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113682

            Bug ID: 113682
           Summary: Branches in branchless binary search rather than
                    cmov/csel/csinc
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: other
          Assignee: unassigned at gcc dot gnu.org
          Reporter: redbeard0531 at gmail dot com
  Target Milestone: ---

I've been trying to eliminate unpredictable branches in a hot function where
perf counters show a high mispredict percentage. Unfortunately, I haven't been
able to find an incantation to get gcc to generate branchless code other than
inline asm, which I'd rather avoid. In this case I've even laid out the
critical lines so that they exactly match the behavior of the csinc and csel
instructions on arm64, but they are still not used.


Somewhat minimized repro:

typedef unsigned long size_t;
struct ITEM {char* addr; size_t len;};
int cmp(ITEM* user, ITEM* tree);

size_t bsearch2(ITEM* user, ITEM** tree, size_t treeSize) {
    auto pos = tree;
    size_t low = 0;
    size_t high = treeSize;
    while (low < high) {
        size_t i = (low + high) / 2;
        int res = cmp(user, tree[i]);

        // These should be cmp + csinc + csel on arm
        // and lea + test + cmov + cmov on x86.
        low = res > 0 ? i + 1 : low; // csinc
        high = res < 0 ? i: high; // csel

        if (res == 0) return i;
    }
    return -1;
}


On arm64 that generates a conditional branch on res > 0:
        bl      cmp(ITEM*, ITEM*)
        cmp     w0, 0
        bgt     .L15 // does low = i + 1 then loops
        mov     x20, x19
        bne     .L4 // loop


On x86_64 it does similar:
        call    cmp(ITEM*, ITEM*)
        test    eax, eax
        jg      .L16 
        jne     .L17


Note that clang produces the desired codegen for both:

arm:
        bl      cmp(ITEM*, ITEM*)
        cmp     w0, #0
        csinc   x23, x23, x22, le
        csel    x19, x22, x19, lt
        cbnz    w0, .LBB0_1

x86:
        call    cmp(ITEM*, ITEM*)@PLT
        lea     rcx, [r12 + 1]
        test    eax, eax
        cmovg   r13, rcx
        cmovs   rbx, r12
        jne     .LBB0_1

(full output for all 4 available at https://www.godbolt.org/z/aWrKbYPTG. Code
snippets from trunk, but also repos on 13.2)

While ideally gcc would generate the branchless output for the supplied code,
if there is some (reasonable) incantation that would cause it to produce
branchless output, I'd be happy to have that too.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug rtl-optimization/113682] Branches in branchless binary search rather than cmov/csel/csinc
  2024-01-31 13:48 [Bug other/113682] New: Branches in branchless binary search rather than cmov/csel/csinc redbeard0531 at gmail dot com
@ 2024-01-31 14:16 ` rguenth at gcc dot gnu.org
  2024-01-31 16:53 ` pinskia at gcc dot gnu.org
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-01-31 14:16 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113682

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
          Component|other                       |rtl-optimization
            Version|unknown                     |14.0
             Target|                            |aarch64, x86_64-*-*

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Since there's a loop exit involved (and the loop has multiple exits)
if-conversion is made difficult here.

You could try rotating manually producing a do { } while loop with
a "nicer" exit condition and see whether that helps.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug rtl-optimization/113682] Branches in branchless binary search rather than cmov/csel/csinc
  2024-01-31 13:48 [Bug other/113682] New: Branches in branchless binary search rather than cmov/csel/csinc redbeard0531 at gmail dot com
  2024-01-31 14:16 ` [Bug rtl-optimization/113682] " rguenth at gcc dot gnu.org
@ 2024-01-31 16:53 ` pinskia at gcc dot gnu.org
  2024-01-31 18:00 ` [Bug middle-end/113682] " pinskia at gcc dot gnu.org
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-01-31 16:53 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113682

--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
I should note that on x86, 2 cmov in a row might be an issue and worse than
branches. There is a cost model and the x86 backend rejects that.

There are some cores where it is worse. I don't know if it applies to recent
ones though.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug middle-end/113682] Branches in branchless binary search rather than cmov/csel/csinc
  2024-01-31 13:48 [Bug other/113682] New: Branches in branchless binary search rather than cmov/csel/csinc redbeard0531 at gmail dot com
  2024-01-31 14:16 ` [Bug rtl-optimization/113682] " rguenth at gcc dot gnu.org
  2024-01-31 16:53 ` pinskia at gcc dot gnu.org
@ 2024-01-31 18:00 ` pinskia at gcc dot gnu.org
  2024-02-01 10:30 ` tnfchris at gcc dot gnu.org
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-01-31 18:00 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113682

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|rtl-optimization            |middle-end

--- Comment #3 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
From the looks of it, jump threading is getting in the way of conditional move
generation.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug middle-end/113682] Branches in branchless binary search rather than cmov/csel/csinc
  2024-01-31 13:48 [Bug other/113682] New: Branches in branchless binary search rather than cmov/csel/csinc redbeard0531 at gmail dot com
                   ` (2 preceding siblings ...)
  2024-01-31 18:00 ` [Bug middle-end/113682] " pinskia at gcc dot gnu.org
@ 2024-02-01 10:30 ` tnfchris at gcc dot gnu.org
  2024-02-01 13:52 ` redbeard0531 at gmail dot com
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2024-02-01 10:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113682

Tamar Christina <tnfchris at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |tnfchris at gcc dot gnu.org

--- Comment #4 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
note that on aarch64 the situation isn't that simple and whether the csel or
branch is better in the loop depends on a number of factors.

We did extensive benchmarking on this on various uArches and there are a number
of cases where the csel inside the loop ends up being significantly worse and
the branch misprediction penalty also highly depends on the uArch.

In the end the conclusion was though that to have chance to do this properly we
would have to form conditional operations at the GIMPLE level and not in RTL as
we do today.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug middle-end/113682] Branches in branchless binary search rather than cmov/csel/csinc
  2024-01-31 13:48 [Bug other/113682] New: Branches in branchless binary search rather than cmov/csel/csinc redbeard0531 at gmail dot com
                   ` (3 preceding siblings ...)
  2024-02-01 10:30 ` tnfchris at gcc dot gnu.org
@ 2024-02-01 13:52 ` redbeard0531 at gmail dot com
  2024-02-02  1:04 ` pinskia at gcc dot gnu.org
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: redbeard0531 at gmail dot com @ 2024-02-01 13:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113682

--- Comment #5 from Mathias Stearn <redbeard0531 at gmail dot com> ---
(In reply to Andrew Pinski from comment #2)
> I should note that on x86, 2 cmov in a row might be an issue and worse than
> branches. There is a cost model and the x86 backend rejects that.
> 
> There are some cores where it is worse. I don't know if it applies to recent
> ones though.

Do you know if that applies to any cores that support x86_64? I checked Agner
Fog's tables, and only very very old cores (P4 era) had high reciprocal
throughput, but even then it was less than latency. It looks like all AMD cores
and intel cores newer than ivy bridge (ie everything from the last 10 years)
are able to execute multiple CMOVs per cycle (reciprocal throughput < 1). From
what I can see, it looks like bad CMOV was a particular problem of the Pentium
4 and Prescott cores, and possibly PPro, but I don't see the numbers for it. I
don't think any of those cores should have an impact on the default cost model
in 2024.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug middle-end/113682] Branches in branchless binary search rather than cmov/csel/csinc
  2024-01-31 13:48 [Bug other/113682] New: Branches in branchless binary search rather than cmov/csel/csinc redbeard0531 at gmail dot com
                   ` (4 preceding siblings ...)
  2024-02-01 13:52 ` redbeard0531 at gmail dot com
@ 2024-02-02  1:04 ` pinskia at gcc dot gnu.org
  2024-04-02 17:46 ` [Bug rtl-optimization/113682] " tnfchris at gcc dot gnu.org
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-02-02  1:04 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113682

--- Comment #6 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Mathias Stearn from comment #5)
> Do you know if that applies to any cores that support x86_64? I checked
> Agner Fog's tables, and only very very old cores (P4 era) had high
> reciprocal throughput, but even then it was less than latency. It looks like
> all AMD cores and intel cores newer than ivy bridge (ie everything from the
> last 10 years) are able to execute multiple CMOVs per cycle (reciprocal
> throughput < 1). From what I can see, it looks like bad CMOV was a
> particular problem of the Pentium 4 and Prescott cores, and possibly PPro,
> but I don't see the numbers for it. I don't think any of those cores should
> have an impact on the default cost model in 2024.

/* X86_TUNE_ONE_IF_CONV_INSNS: Restrict a number of cmov insns in
   if-converted sequence to one.  */
DEF_TUNE (X86_TUNE_ONE_IF_CONV_INSN, "one_if_conv_insn",
          m_SILVERMONT | m_KNL | m_KNM | m_INTEL | m_CORE_ALL | m_GOLDMONT
          | m_GOLDMONT_PLUS | m_TREMONT | m_CORE_HYBRID | m_CORE_ATOM
          | m_ZHAOXIN | m_GENERIC)


So it looks like it is the low power atom cores which still have this issue.
Tremont is from 2021 so you can't say those cores should not impact default
cost models ...

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug rtl-optimization/113682] Branches in branchless binary search rather than cmov/csel/csinc
  2024-01-31 13:48 [Bug other/113682] New: Branches in branchless binary search rather than cmov/csel/csinc redbeard0531 at gmail dot com
                   ` (5 preceding siblings ...)
  2024-02-02  1:04 ` pinskia at gcc dot gnu.org
@ 2024-04-02 17:46 ` tnfchris at gcc dot gnu.org
  2024-04-02 18:01 ` pinskia at gcc dot gnu.org
  2024-04-03 11:10 ` tnfchris at gcc dot gnu.org
  8 siblings, 0 replies; 10+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2024-04-02 17:46 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113682

Tamar Christina <tnfchris at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1
          Component|middle-end                  |rtl-optimization
   Last reconfirmed|                            |2024-04-02

--- Comment #7 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
Ok, so for now I think we should separate out the issue of whether/when we
should perform the ifconversion and instead for a bit look at why we miss these
conversions:

The first problem is that GCC does not support if-converting blocks which
terminate into another branch.  In other words, the destination of 1 of the
arms must have a single successor per noce_find_if_block which requires the
final block to be a JOIN block.

In

size_t bsearch2(ITEM* user, ITEM** tree, size_t treeSize) {
    auto pos = tree;
    size_t low = 0;
    size_t high = treeSize;
    while (low < high) {
        size_t i = (low + high) / 2;
        int res = cmp(user, tree[i]);

        // These should be cmp + csinc + csel on arm
        // and lea + test + cmov + cmov on x86.
        low = res > 0 ? i + 1 : low; // csinc
        high = res < 0 ? i: high; // csel

        if (res == 0) return i;
    }
    return -1;
}

this fails because of the:

        if (res == 0) return i;

which means that the final block doesn't adhere to this.  so the cbranch blocks
the ifconversion.

if we simplify the testcase:

typedef unsigned long size_t;
extern int cmp(size_t *);

size_t bsearch2(size_t high, size_t low, size_t i) {
    while (low < high) {
        int res = cmp(&i);

        low = res > 0 ? i + 1 : low;
        high = res < 0 ? i: high;
    }
    return -1;
}

we now detect the csel but not the csinc. This has two reasons, we visit the BB
in entry->exit order, but we probably should do exit->entry, such that when we
if-convert blocks and delete the branches, the preceeding blocks become valid
to if-convert.

but the main reason the csinc is missed is because of code sinking. The update
of the reference &i is sank into the res > 0 branch.

  <bb 3> [local count: 955630224]:
  # high_18 = PHI <high_14(7), high_7(D)(2)>
  # low_19 = PHI <low_5(7), low_8(D)(2)>
  res_11 = cmp (&i);
  if (res_11 > 0)
    goto <bb 4>; [59.00%]
  else
    goto <bb 5>; [41.00%]

  <bb 4> [local count: 563821836]:
  i.1_1 = i;
  iftmp.0_12 = i.1_1 + 1;
  goto <bb 7>; [100.00%]

This is because the compiler (righfully) decides that on the dereference should
only happen in the branch that needs it.  Because of this block 4 is no longer
if-convertable.  However since the load is to the stack we are allowed to move
it around technically speaking.  But this is only beneficial is the branch is
really unpredictable.

To show that it is the sinking, we can avoid it by using -fdisable-tree-sink2
and this example:

size_t bsearch2(size_t high, size_t low, size_t *iv, int *resv) {
    while (low < high) {
        int res = cmp (iv);
        size_t i = *iv;
        low = res > 0 ? i + 1 : low;
    }
    return -1;
}

which generates the csinc.

However the following example again doesn't generate the csinc for GCC but does
for LLVM:

size_t bsearch3(size_t high, size_t low, size_t i, int res) {
    while (low < high) {
        asm volatile ("" : "=w"(i));
        asm volatile ("" : "=w"(res));
        low = res > 0 ? i + 1 : low;
    }
    return -1;
}

which is because of the change header pass.  GCC optimizes this scalar code
from:

size_t bsearch3 (size_t high, size_t low, size_t i, int res)
{
  size_t iftmp.0_7;

  <bb 2> [local count: 59055799]:
  goto <bb 5>; [100.00%]

  <bb 3> [local count: 1014686024]:
  __asm__ __volatile__("" : "=w" i_5);
  __asm__ __volatile__("" : "=w" res_6);
  if (res_6 > 0)
    goto <bb 4>; [5.50%]
  else
    goto <bb 6>; [94.50%]

  <bb 4> [local count: 55807731]:
  iftmp.0_7 = i_5 + 1;

  <bb 5> [local count: 114863530]:
  # low_1 = PHI <low_2(D)(2), iftmp.0_7(4)>

  <bb 6> [local count: 1073741824]:
  if (low_1 < high_3(D))
    goto <bb 3>; [94.50%]
  else
    goto <bb 7>; [5.50%]

  <bb 7> [local count: 59055800]:
  return 18446744073709551615;

}

into 

size_t bsearch3 (size_t high, size_t low, size_t i, int res)
{
  size_t iftmp.0_7;

  <bb 2> [local count: 59055799]:
  if (low_2(D) < high_3(D))
    goto <bb 5>; [48.59%]
  else
    goto <bb 6>; [51.41%]

  <bb 3> [local count: 958878294]:
  __asm__ __volatile__("" : "=w" i_5);
  __asm__ __volatile__("" : "=w" res_6);
  if (res_6 > 0)
    goto <bb 4>; [5.50%]
  else
    goto <bb 3>; [94.50%]

  <bb 4> [local count: 55807731]:
  # i_10 = PHI <i_5(3), i_8(5)>
  iftmp.0_7 = i_10 + 1;
  if (high_3(D) > iftmp.0_7)
    goto <bb 5>; [48.59%]
  else
    goto <bb 6>; [51.41%]

  <bb 5> [local count: 55807730]:
  __asm__ __volatile__("" : "=w" i_8);
  __asm__ __volatile__("" : "=w" res_9);
  if (res_9 > 0)
    goto <bb 4>; [5.50%]
  else
    goto <bb 3>; [94.50%]

  <bb 6> [local count: 59055800]:
  return 18446744073709551615;

}

which then hits the limitation of the single pred I mentioned before.

using -fdisable-tree-ch2 we get:

        add     x2, x2, 1
        csel    x1, x2, x1, gt

shows that oddly we missed the csinc which means noce_try_addcc doesn't trigger
for some reason.

ultimately RTL seems like the wrong place to fix up these issues as it's much
easier to do in GIMPLE where we can more freely modify the CFG.  Though ce1 and
ce2 are still in CFG layout mode so we could..

But I also think gimple is better as it allows us to check the critical path
and the loop carried dependencies by walking the use-def chains cheaply.

Perhaps we can use something like gimple-isel to either do the conversion, or
use another pass to reshape the gimple so CE can find them?

Additionally it would be good to have a way for users to tell us that the
branch is unpredictable and so that the heuristics should be replaxed.  LLVM
has __builtin_unpredictable to help it along.

Perhaps we should do the same? Though there is already
__builtin_expect_with_probability and I suppose a ~50% predictable chance
should be interpreted as unpredictable?

thoughts?

note that these are only issues for loops, outside of loops we detect them
reliably because the transformations above don't happen.:

unsigned long long
test_csinc64_ifcvt(unsigned long long *x0,
                   unsigned long long x1,
                   unsigned long long x2) {
  if (*x0 == x1)
    ++ x2;

  return x2;
}

does the right thing.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug rtl-optimization/113682] Branches in branchless binary search rather than cmov/csel/csinc
  2024-01-31 13:48 [Bug other/113682] New: Branches in branchless binary search rather than cmov/csel/csinc redbeard0531 at gmail dot com
                   ` (6 preceding siblings ...)
  2024-04-02 17:46 ` [Bug rtl-optimization/113682] " tnfchris at gcc dot gnu.org
@ 2024-04-02 18:01 ` pinskia at gcc dot gnu.org
  2024-04-03 11:10 ` tnfchris at gcc dot gnu.org
  8 siblings, 0 replies; 10+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-04-02 18:01 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113682

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           See Also|                            |https://gcc.gnu.org/bugzill
                   |                            |a/show_bug.cgi?id=112402

--- Comment #8 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
This might be the path splitting running on the gimple level causing issues
too; see PR 112402 .

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug rtl-optimization/113682] Branches in branchless binary search rather than cmov/csel/csinc
  2024-01-31 13:48 [Bug other/113682] New: Branches in branchless binary search rather than cmov/csel/csinc redbeard0531 at gmail dot com
                   ` (7 preceding siblings ...)
  2024-04-02 18:01 ` pinskia at gcc dot gnu.org
@ 2024-04-03 11:10 ` tnfchris at gcc dot gnu.org
  8 siblings, 0 replies; 10+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2024-04-03 11:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113682

--- Comment #9 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to Andrew Pinski from comment #8)
> This might be the path splitting running on the gimple level causing issues
> too; see PR 112402 .

Ah that's a good shout.  It looks like Richi already agrees that we should
recognize/do some ifcvt at GIMPLE.

Guess that just leaves the where.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-04-03 11:10 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-31 13:48 [Bug other/113682] New: Branches in branchless binary search rather than cmov/csel/csinc redbeard0531 at gmail dot com
2024-01-31 14:16 ` [Bug rtl-optimization/113682] " rguenth at gcc dot gnu.org
2024-01-31 16:53 ` pinskia at gcc dot gnu.org
2024-01-31 18:00 ` [Bug middle-end/113682] " pinskia at gcc dot gnu.org
2024-02-01 10:30 ` tnfchris at gcc dot gnu.org
2024-02-01 13:52 ` redbeard0531 at gmail dot com
2024-02-02  1:04 ` pinskia at gcc dot gnu.org
2024-04-02 17:46 ` [Bug rtl-optimization/113682] " tnfchris at gcc dot gnu.org
2024-04-02 18:01 ` pinskia at gcc dot gnu.org
2024-04-03 11:10 ` tnfchris at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).