[Bug tree-optimization/104950] New: GCC does not emit branchless code

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/104950] New: GCC does not emit branchless code
@ 2022-03-16  8:51 vincenzo.innocente at cern dot ch
  2022-03-16  9:08 ` [Bug rtl-optimization/104950] GCC does not emit branchless code for load next to each other pinskia at gcc dot gnu.org
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: vincenzo.innocente at cern dot ch @ 2022-03-16  8:51 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104950

            Bug ID: 104950
           Summary: GCC does not emit branchless code
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: vincenzo.innocente at cern dot ch
  Target Milestone: ---

In this example GCC fails to emit branchless code while CLANG does.
In the actual application, measurements shows slow down up to a factor 2.
I managed to force branchless (-DBL) but the code is pretty unfriendly
godbolt link (GCC, clang, GCC -DBL 

https://godbolt.org/z/KWY1rjhhY



and here inlined

include <vector>
const float defaultBaseResponse = 0.5;
class DForest {
public:
    //based on FastForest::evaluate() and BDTree::parseTree()
    DForest() {
    }
    float evaluate(const float* features) const;

    std::vector<int> rootIndices_;
    //"node" layout: cut, index, left, right
    struct Node{
        float v; int i,l,r;
        constexpr int eval(float const * f) const {
#ifdef BL 
          auto m = f[i] > v;
          return *((&l) + int(m));
#else
          return f[i] > v ? r : l;
#endif
        }
    };
    std::vector<Node> nodes_;
    std::vector<float> responses_;
    std::vector<float> baseResponses_;
};

float DForest::evaluate(const float* features) const{
    float sum{defaultBaseResponse + baseResponses_[0]};
    for(int index : rootIndices_){
        do {
            index = nodes_[index].eval(features);
        } while (index>0);
        sum += responses_[-index];
    }
    return sum;
}

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug rtl-optimization/104950] GCC does not emit branchless code for load next to each other
  2022-03-16  8:51 [Bug tree-optimization/104950] New: GCC does not emit branchless code vincenzo.innocente at cern dot ch
@ 2022-03-16  9:08 ` pinskia at gcc dot gnu.org
  2022-03-16  9:18 ` rguenth at gcc dot gnu.org
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2022-03-16  9:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104950

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2022-03-16
            Summary|GCC does not emit           |GCC does not emit
                   |branchless code             |branchless code for load
                   |                            |next to each other
     Ever confirmed|0                           |1
           Severity|normal                      |enhancement

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Confirmed, reduced testcase:
struct Node{
    int i, l,r;
};
int eval(int t, struct Node *a)  {
    return t == 0 ? a->r : a->l;
}

Note on aarch64, if we remove the i field, then ifcvt on the RTL level is able
to catch it but it does not do it for x86_64 (maybe a cost issue).

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug rtl-optimization/104950] GCC does not emit branchless code for load next to each other
  2022-03-16  8:51 [Bug tree-optimization/104950] New: GCC does not emit branchless code vincenzo.innocente at cern dot ch
  2022-03-16  9:08 ` [Bug rtl-optimization/104950] GCC does not emit branchless code for load next to each other pinskia at gcc dot gnu.org
@ 2022-03-16  9:18 ` rguenth at gcc dot gnu.org
  2022-03-16  9:21 ` rguenth at gcc dot gnu.org
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-03-16  9:18 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104950

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Andrew Pinski from comment #1)
> Confirmed, reduced testcase:
> struct Node{
>     int i, l,r;
> };
> int eval(int t, struct Node *a)  {
>     return t == 0 ? a->r : a->l;
> }
> 
> Note on aarch64, if we remove the i field, then ifcvt on the RTL level is
> able to catch it but it does not do it for x86_64 (maybe a cost issue).

I think it's for the fear of one of a->r / a->l trapping while the other is
not.  With

struct Node{
    int i, l,r;
};
int eval(int t, struct Node *a)  {
    int r = a->r;
    int l = a->l;
    return t == 0 ? r : l;
}

and using -fno-tree-sink we get branchless code on x86_64.

I think we can use TBAA to argue that *a should be valid to dereference here.
What Andrew shows with removing 'i' is likely struct Node then having
large enough alignment guarantees(?) or a bug in the aarch64 backend.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug rtl-optimization/104950] GCC does not emit branchless code for load next to each other
  2022-03-16  8:51 [Bug tree-optimization/104950] New: GCC does not emit branchless code vincenzo.innocente at cern dot ch
  2022-03-16  9:08 ` [Bug rtl-optimization/104950] GCC does not emit branchless code for load next to each other pinskia at gcc dot gnu.org
  2022-03-16  9:18 ` rguenth at gcc dot gnu.org
@ 2022-03-16  9:21 ` rguenth at gcc dot gnu.org
  2022-03-16  9:27 ` crazylht at gmail dot com
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-03-16  9:21 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104950

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
Ah, on aarch64 we get

        cmp     w0, 0
        add     x0, x1, 4
        csel    x0, x0, x1, eq
        ldr     w0, [x0]

so we do not load from the possibly trapping mem.  With the testcase I provided
and -fno-tree-sink on x86_64 we get

        movl    4(%rsi), %eax
        testl   %edi, %edi
        cmove   8(%rsi), %eax

which would be prone to the issue.  I suppose ifcvt is oddly restricted for the
first case.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug rtl-optimization/104950] GCC does not emit branchless code for load next to each other
  2022-03-16  8:51 [Bug tree-optimization/104950] New: GCC does not emit branchless code vincenzo.innocente at cern dot ch
                   ` (2 preceding siblings ...)
  2022-03-16  9:21 ` rguenth at gcc dot gnu.org
@ 2022-03-16  9:27 ` crazylht at gmail dot com
  2022-03-16  9:31 ` pinskia at gcc dot gnu.org
  2022-03-16  9:58 ` crazylht at gmail dot com
  5 siblings, 0 replies; 7+ messages in thread
From: crazylht at gmail dot com @ 2022-03-16  9:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104950

Hongtao.liu <crazylht at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |crazylht at gmail dot com

--- Comment #4 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Richard Biener from comment #3)
> Ah, on aarch64 we get
> 
>         cmp     w0, 0
>         add     x0, x1, 4
>         csel    x0, x0, x1, eq
>         ldr     w0, [x0]
> 
> so we do not load from the possibly trapping mem.  With the testcase I
> provided and -fno-tree-sink on x86_64 we get

Not for this one

float
foo (float a, float b, float *c, int i, int j)
{
    return a > b ? c[i] : c[j];
}

gcc
        vcomiss xmm0, xmm1
        jbe     .L6
        movsx   rsi, esi
        vmovss  xmm0, DWORD PTR [rdi+rsi*4]
        ret
.L6:
        movsx   rdx, edx
        vmovss  xmm0, DWORD PTR [rdi+rdx*4]
        ret
llvm
         vucomiss        xmm0, xmm1
        cmovbe  esi, edx
        movsxd  rax, esi
        vmovss  xmm0, dword ptr [rdi + 4*rax]
        ret

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug rtl-optimization/104950] GCC does not emit branchless code for load next to each other
  2022-03-16  8:51 [Bug tree-optimization/104950] New: GCC does not emit branchless code vincenzo.innocente at cern dot ch
                   ` (3 preceding siblings ...)
  2022-03-16  9:27 ` crazylht at gmail dot com
@ 2022-03-16  9:31 ` pinskia at gcc dot gnu.org
  2022-03-16  9:58 ` crazylht at gmail dot com
  5 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2022-03-16  9:31 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104950

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           See Also|                            |https://gcc.gnu.org/bugzill
                   |                            |a/show_bug.cgi?id=102008

--- Comment #5 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Hongtao.liu from comment #4)
> (In reply to Richard Biener from comment #3)
> > Ah, on aarch64 we get
> > 
> >         cmp     w0, 0
> >         add     x0, x1, 4
> >         csel    x0, x0, x1, eq
> >         ldr     w0, [x0]
> > 
> > so we do not load from the possibly trapping mem.  With the testcase I
> > provided and -fno-tree-sink on x86_64 we get
> 
> Not for this one
> 
> float
> foo (float a, float b, float *c, int i, int j)
> {
>     return a > b ? c[i] : c[j];
> }

That one is recorded as PR

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug rtl-optimization/104950] GCC does not emit branchless code for load next to each other
  2022-03-16  8:51 [Bug tree-optimization/104950] New: GCC does not emit branchless code vincenzo.innocente at cern dot ch
                   ` (4 preceding siblings ...)
  2022-03-16  9:31 ` pinskia at gcc dot gnu.org
@ 2022-03-16  9:58 ` crazylht at gmail dot com
  5 siblings, 0 replies; 7+ messages in thread
From: crazylht at gmail dot com @ 2022-03-16  9:58 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104950

--- Comment #6 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Andrew Pinski from comment #5)
> (In reply to Hongtao.liu from comment #4)
> > (In reply to Richard Biener from comment #3)
> > > Ah, on aarch64 we get
> > > 
> > >         cmp     w0, 0
> > >         add     x0, x1, 4
> > >         csel    x0, x0, x1, eq
> > >         ldr     w0, [x0]
> > > 
> > > so we do not load from the possibly trapping mem.  With the testcase I
> > > provided and -fno-tree-sink on x86_64 we get
> > 
> > Not for this one
> > 
> > float
> > foo (float a, float b, float *c, int i, int j)
> > {
> >     return a > b ? c[i] : c[j];
> > }
> 
> That one is recorded as PR

Just note -fno-tree-sink works for PR102008, but not for this case.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-03-16  9:58 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-16  8:51 [Bug tree-optimization/104950] New: GCC does not emit branchless code vincenzo.innocente at cern dot ch
2022-03-16  9:08 ` [Bug rtl-optimization/104950] GCC does not emit branchless code for load next to each other pinskia at gcc dot gnu.org
2022-03-16  9:18 ` rguenth at gcc dot gnu.org
2022-03-16  9:21 ` rguenth at gcc dot gnu.org
2022-03-16  9:27 ` crazylht at gmail dot com
2022-03-16  9:31 ` pinskia at gcc dot gnu.org
2022-03-16  9:58 ` crazylht at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).