public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/99434] New: std::bit_cast generates more instructions than __builtin_bit_cast and memcpy with -march=native
@ 2021-03-06 20:12 unlvsur at live dot com
  2021-03-06 21:25 ` [Bug target/99434] " pinskia at gcc dot gnu.org
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: unlvsur at live dot com @ 2021-03-06 20:12 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99434

            Bug ID: 99434
           Summary: std::bit_cast generates more instructions than
                    __builtin_bit_cast and memcpy with -march=native
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: unlvsur at live dot com
  Target Milestone: ---

https://godbolt.org/z/5KWM8Y
struct u64x2_t
{
#if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
    std::uint64_t high,low;
#else
    std::uint64_t low,high;
#endif
};
u64x2_t umul5(std::uint64_t a,std::uint64_t b) noexcept
{
    return std::bit_cast<u64x2_t>(static_cast<__uint128_t>(a)*b);
}

u64x2_t umul_builtin(std::uint64_t a,std::uint64_t b) noexcept
{
    return __builtin_bit_cast(u64x2_t,static_cast<__uint128_t>(a)*b);
}

assembly:
umul5(unsigned long, unsigned long):
        movq    %rdi, %rdx
        mulx    %rsi, %rdx, %rcx
        movq    %rdx, %rax
        movq    %rcx, %rdx
        ret
umul_builtin(unsigned long, unsigned long):
        movq    %rdi, %rdx
        mulx    %rsi, %rax, %rdx
        ret

There is another issue:

std::uint64_t umul128(std::uint64_t a,std::uint64_t b,std::uint64_t& high)
noexcept
{
    __uint128_t res{static_cast<__uint128_t>(a)*b};
    high=static_cast<std::uint64_t>(res>>64);
    return static_cast<std::uint64_t>(res);
}
I cannot do this since this generates more instructions than using memcpy to
pun types.

clang does not have this issue and all cases are dealt with correctly.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/99434] std::bit_cast generates more instructions than __builtin_bit_cast and memcpy with -march=native
  2021-03-06 20:12 [Bug tree-optimization/99434] New: std::bit_cast generates more instructions than __builtin_bit_cast and memcpy with -march=native unlvsur at live dot com
@ 2021-03-06 21:25 ` pinskia at gcc dot gnu.org
  2021-03-06 21:36 ` unlvsur at live dot com
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-03-06 21:25 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99434

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |enhancement
             Target|                            |x86_64
           Keywords|                            |missed-optimization, ra
          Component|tree-optimization           |target

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
This is just a register allocation issue dealing with mulx and TImode.

If mulq was used instead (that is without -march=native), all of the functions
are done correctly.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/99434] std::bit_cast generates more instructions than __builtin_bit_cast and memcpy with -march=native
  2021-03-06 20:12 [Bug tree-optimization/99434] New: std::bit_cast generates more instructions than __builtin_bit_cast and memcpy with -march=native unlvsur at live dot com
  2021-03-06 21:25 ` [Bug target/99434] " pinskia at gcc dot gnu.org
@ 2021-03-06 21:36 ` unlvsur at live dot com
  2021-03-06 22:50 ` pinskia at gcc dot gnu.org
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: unlvsur at live dot com @ 2021-03-06 21:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99434

--- Comment #2 from cqwrteur <unlvsur at live dot com> ---
(In reply to Andrew Pinski from comment #1)
> This is just a register allocation issue dealing with mulx and TImode.
> 
> If mulq was used instead (that is without -march=native), all of the
> functions are done correctly.

I do not think so. I think GCC generally did things like this wrong. I have
even found out how to produce different wrong results deterministically.

For example like this
https://godbolt.org/z/PbobYG

Any time it deals with things like >>32 or >>64, it produces a slower result.
This even compiles without -march=native.

While clang generates exactly the same assembly which means my result is
correct. GCC does things for this wrong.

It looks like we need more optimizations on trees for these patterns.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/99434] std::bit_cast generates more instructions than __builtin_bit_cast and memcpy with -march=native
  2021-03-06 20:12 [Bug tree-optimization/99434] New: std::bit_cast generates more instructions than __builtin_bit_cast and memcpy with -march=native unlvsur at live dot com
  2021-03-06 21:25 ` [Bug target/99434] " pinskia at gcc dot gnu.org
  2021-03-06 21:36 ` unlvsur at live dot com
@ 2021-03-06 22:50 ` pinskia at gcc dot gnu.org
  2021-03-08 16:49 ` jakub at gcc dot gnu.org
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-03-06 22:50 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99434

--- Comment #3 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to cqwrteur from comment #2)
> (In reply to Andrew Pinski from comment #1)
> > This is just a register allocation issue dealing with mulx and TImode.
> > 
> > If mulq was used instead (that is without -march=native), all of the
> > functions are done correctly.
> 
> I do not think so. I think GCC generally did things like this wrong. I have
> even found out how to produce different wrong results deterministically.
> 
> For example like this
> https://godbolt.org/z/PbobYG
> 
> Any time it deals with things like >>32 or >>64, it produces a slower result.
> This even compiles without -march=native.

This is still a register allocation issue, this time dealing with DImode on
32bit. GCC has a known issue with register allocation when dealing with values
stored into two registers.  See PR 21150, PR 43644, PR 50339, etc.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/99434] std::bit_cast generates more instructions than __builtin_bit_cast and memcpy with -march=native
  2021-03-06 20:12 [Bug tree-optimization/99434] New: std::bit_cast generates more instructions than __builtin_bit_cast and memcpy with -march=native unlvsur at live dot com
                   ` (2 preceding siblings ...)
  2021-03-06 22:50 ` pinskia at gcc dot gnu.org
@ 2021-03-08 16:49 ` jakub at gcc dot gnu.org
  2021-03-08 16:50 ` jakub at gcc dot gnu.org
  2021-03-08 20:40 ` unlvsur at live dot com
  5 siblings, 0 replies; 7+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-03-08 16:49 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99434

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org,
                   |                            |jamborm at gcc dot gnu.org,
                   |                            |vmakarov at gcc dot gnu.org
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1
   Last reconfirmed|                            |2021-03-08

--- Comment #4 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
The umul5 case in #c0 is worse because of SRA.
With -O2 -fno-tree-sra optimized dump looks like:
  _3 = a_4(D) w* b_5(D);
  D.2396 = VIEW_CONVERT_EXPR<struct u64x2_t>(_3);
  D.2383 = D.2396;
  return D.2383;
and
  _3 = a_4(D) w* b_5(D);
  D.2389 = VIEW_CONVERT_EXPR<struct u64x2_t>(_3);
  return D.2389;
for the two functions and even when there is the superfluous copying we emit
the same assembly.
But with SRA the former becomes:
  _3 = a_4(D) w* b_5(D);
  D.2396 = VIEW_CONVERT_EXPR<struct u64x2_t>(_3);
  SR.6_12 = D.2396.low;
  SR.7_13 = D.2396.high;
  D.2383.low = SR.6_12;
  D.2383.high = SR.7_13;
  return D.2383;
In the -fno-tree-sra case the IL just contains one extra TImode pseudo ->
pseudo
assignment which is shortly optimized away, so we have just:
(insn 7 4 13 2 (parallel [
            (set (reg:TI 87)
                (mult:TI (zero_extend:TI (reg:DI 89))
                    (zero_extend:TI (reg:DI 90))))
            (clobber (reg:CC 17 flags))
        ]) "pr99434.C":23:66 426 {*umulditi3_1}
     (expr_list:REG_DEAD (reg:DI 90)
        (expr_list:REG_DEAD (reg:DI 89)
            (expr_list:REG_UNUSED (reg:CC 17 flags)
                (nil)))))
(insn 13 7 14 2 (set (reg/i:TI 0 ax)
        (reg:TI 87)) "pr99434.C":24:1 73 {*movti_internal}
     (expr_list:REG_DEAD (reg:TI 87)
        (nil)))
(insn 14 13 0 2 (use (reg/i:TI 0 ax)) "pr99434.C":24:1 -1
     (nil))
before reload, while with SRA we have:
(insn 7 4 19 2 (parallel [
            (set (reg:TI 90)
                (mult:TI (zero_extend:TI (reg:DI 98))
                    (zero_extend:TI (reg:DI 99))))
            (clobber (reg:CC 17 flags))
        ]) "pr99434.C":18:57 426 {*umulditi3_1}
     (expr_list:REG_DEAD (reg:DI 99)
        (expr_list:REG_DEAD (reg:DI 98)
            (expr_list:REG_UNUSED (reg:CC 17 flags)
                (nil)))))
(insn 19 7 20 2 (set (reg:DI 92 [ D.2396 ])
        (subreg:DI (reg:TI 90) 0)) "pr99434.C":5:40 74 {*movdi_internal}
     (nil))
(insn 20 19 23 2 (set (reg:DI 93 [ D.2396+8 ])
        (subreg:DI (reg:TI 90) 8)) "pr99434.C":5:40 74 {*movdi_internal}
     (expr_list:REG_DEAD (reg:TI 90)
        (nil)))
(insn 23 20 24 2 (set (reg:DI 0 ax)
        (reg:DI 92 [ D.2396 ])) "pr99434.C":19:1 74 {*movdi_internal}
     (expr_list:REG_DEAD (reg:DI 92 [ D.2396 ])
        (nil)))
(insn 24 23 17 2 (set (reg:DI 1 dx [+8 ])
        (reg:DI 93 [ D.2396+8 ])) "pr99434.C":19:1 74 {*movdi_internal}
     (expr_list:REG_DEAD (reg:DI 93 [ D.2396+8 ])
        (nil)))
(insn 17 24 0 2 (use (reg/i:TI 0 ax)) "pr99434.C":19:1 -1
     (nil))
While in both cases we get the same (right) RA decisions about the umulditi3_1,
and the IRA decisions seems to be good too:
      Popping a2(r90,l0)  --         assign reg 0
      Popping a4(r98,l0)  --         assign reg 5
      Popping a0(r93,l0)  --         assign reg 1
      Popping a1(r92,l0)  --         assign reg 0
      Popping a3(r99,l0)  --         assign reg 4
for some reason LRA then decides to use different registers...

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/99434] std::bit_cast generates more instructions than __builtin_bit_cast and memcpy with -march=native
  2021-03-06 20:12 [Bug tree-optimization/99434] New: std::bit_cast generates more instructions than __builtin_bit_cast and memcpy with -march=native unlvsur at live dot com
                   ` (3 preceding siblings ...)
  2021-03-08 16:49 ` jakub at gcc dot gnu.org
@ 2021-03-08 16:50 ` jakub at gcc dot gnu.org
  2021-03-08 20:40 ` unlvsur at live dot com
  5 siblings, 0 replies; 7+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-03-08 16:50 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99434

--- Comment #5 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Testcase without includes:
template<typename _To, typename _From>
constexpr _To
bit_cast(const _From& __from) noexcept
{
  return __builtin_bit_cast(_To, __from);
}

struct u64x2_t
{
#if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
  unsigned long long high,low;
#else
  unsigned long long low,high;
#endif
};
u64x2_t umul5(unsigned long long a,unsigned long long b) noexcept
{
  return bit_cast<u64x2_t>(static_cast<__uint128_t>(a)*b);
}

u64x2_t umul_builtin(unsigned long long a,unsigned long long b) noexcept
{
  return __builtin_bit_cast(u64x2_t,static_cast<__uint128_t>(a)*b);
}

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/99434] std::bit_cast generates more instructions than __builtin_bit_cast and memcpy with -march=native
  2021-03-06 20:12 [Bug tree-optimization/99434] New: std::bit_cast generates more instructions than __builtin_bit_cast and memcpy with -march=native unlvsur at live dot com
                   ` (4 preceding siblings ...)
  2021-03-08 16:50 ` jakub at gcc dot gnu.org
@ 2021-03-08 20:40 ` unlvsur at live dot com
  5 siblings, 0 replies; 7+ messages in thread
From: unlvsur at live dot com @ 2021-03-08 20:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99434

--- Comment #6 from cqwrteur <unlvsur at live dot com> ---
(In reply to Jakub Jelinek from comment #5)
> Testcase without includes:
> template<typename _To, typename _From>
> constexpr _To
> bit_cast(const _From& __from) noexcept
> {
>   return __builtin_bit_cast(_To, __from);
> }
> 
> struct u64x2_t
> {
> #if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
>   unsigned long long high,low;
> #else
>   unsigned long long low,high;
> #endif
> };
> u64x2_t umul5(unsigned long long a,unsigned long long b) noexcept
> {
>   return bit_cast<u64x2_t>(static_cast<__uint128_t>(a)*b);
> }
> 
> u64x2_t umul_builtin(unsigned long long a,unsigned long long b) noexcept
> {
>   return __builtin_bit_cast(u64x2_t,static_cast<__uint128_t>(a)*b);
> }

did you fix the >>64 bug here? should I start another bug to report that?

The bit_cast/memcpy trick is to deal with the issue of. Probably can be
recognized early on than std::bit_cast?

std::uint64_t umul128(std::uint64_t a,std::uint64_t b,std::uint64_t& high)
noexcept
{
    __uint128_t res{static_cast<__uint128_t>(a)*b};
    high=static_cast<std::uint64_t>(res>>64);
    return static_cast<std::uint64_t>(res);
}

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-03-08 20:40 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-06 20:12 [Bug tree-optimization/99434] New: std::bit_cast generates more instructions than __builtin_bit_cast and memcpy with -march=native unlvsur at live dot com
2021-03-06 21:25 ` [Bug target/99434] " pinskia at gcc dot gnu.org
2021-03-06 21:36 ` unlvsur at live dot com
2021-03-06 22:50 ` pinskia at gcc dot gnu.org
2021-03-08 16:49 ` jakub at gcc dot gnu.org
2021-03-08 16:50 ` jakub at gcc dot gnu.org
2021-03-08 20:40 ` unlvsur at live dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).