public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/100694] New: PPC: initialization of __int128 is very inefficient
@ 2021-05-20  7:40 jens.seifert at de dot ibm.com
  2021-05-20 20:28 ` [Bug target/100694] " segher at gcc dot gnu.org
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: jens.seifert at de dot ibm.com @ 2021-05-20  7:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100694

            Bug ID: 100694
           Summary: PPC: initialization of __int128 is very inefficient
           Product: gcc
           Version: 8.3.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

Initializing a __int128 from 2 64-bit integers is implemented very inefficient.

The most natural code which works good on all other platforms generate
additional 2 li 0 + 2 or instructions.

void test2(unsigned __int128* res, unsigned long long hi, unsigned long long
lo)
{
   unsigned __int128 i = hi;
   i <<= 64;
   i |= lo;
   *res = i;
}

_Z5test2Poyy:
.LFB15:
        .cfi_startproc
        li 8,0
        li 11,0
        or 10,5,8
        or 11,11,4
        std 10,0(3)
        std 11,8(3)
        blr
        .long 0
        .byte 0,9,0,0,0,0,0,0
        .cfi_endproc


While for the above sample, "+" instead "|" solves the issues, it generates
addc+addz in other more complicated scenarsion.

The most ugly workaround I can think of I now use as workaround.

void test4(unsigned __int128* res, unsigned long long hi, unsigned long long
lo)
{
   union
   { unsigned __int128 i;
        struct
   {
     unsigned long long lo;
     unsigned long long hi;
   } s;
   } u;
   u.s.lo = lo;
   u.s.hi = hi;
   *res = u.i;
}

This generates the expected code sequence in all cases I have looked at.

_Z5test4Poyy:
.LFB17:
        .cfi_startproc
        std 5,0(3)
        std 4,8(3)
        blr
        .long 0
        .byte 0,9,0,0,0,0,0,0
        .cfi_endproc

Please merge li 0 + or to nop.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/100694] PPC: initialization of __int128 is very inefficient
  2021-05-20  7:40 [Bug target/100694] New: PPC: initialization of __int128 is very inefficient jens.seifert at de dot ibm.com
@ 2021-05-20 20:28 ` segher at gcc dot gnu.org
  2022-07-04 16:56 ` roger at nextmovesoftware dot com
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: segher at gcc dot gnu.org @ 2021-05-20 20:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100694

Segher Boessenkool <segher at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2021-05-20
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |NEW

--- Comment #1 from Segher Boessenkool <segher at gcc dot gnu.org> ---
The important difference between powerpc64 and aarch64 is that the store
is in TImode for powerpc64, but as two DImode stores for aarch64, right
after expand already (and before expand the code was identical).

Confirmed.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/100694] PPC: initialization of __int128 is very inefficient
  2021-05-20  7:40 [Bug target/100694] New: PPC: initialization of __int128 is very inefficient jens.seifert at de dot ibm.com
  2021-05-20 20:28 ` [Bug target/100694] " segher at gcc dot gnu.org
@ 2022-07-04 16:56 ` roger at nextmovesoftware dot com
  2022-07-04 17:17 ` segher at gcc dot gnu.org
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: roger at nextmovesoftware dot com @ 2022-07-04 16:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100694

Roger Sayle <roger at nextmovesoftware dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |roger at nextmovesoftware dot com

--- Comment #2 from Roger Sayle <roger at nextmovesoftware dot com> ---
On x86, I proposed tackling this type of poor code generation issue for TImode
operations by introducing a zero_extendditi2 pattern.  Currently rs6000.md
(also)
doesn't provide a zero extension operation from DImode to TImode, so the
middle-end expands things using SUBREGs, which unfortunately interferes with
combine's ability to optimize things.  Improving x86_64's TImode operations
is still a work in progress, but a patch for issues similar to rs6000.md's was
posted here: https://gcc.gnu.org/pipermail/gcc-patches/2022-June/596165.html
Perhaps similar zero_extend and *concat operations would help on powerpc*?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/100694] PPC: initialization of __int128 is very inefficient
  2021-05-20  7:40 [Bug target/100694] New: PPC: initialization of __int128 is very inefficient jens.seifert at de dot ibm.com
  2021-05-20 20:28 ` [Bug target/100694] " segher at gcc dot gnu.org
  2022-07-04 16:56 ` roger at nextmovesoftware dot com
@ 2022-07-04 17:17 ` segher at gcc dot gnu.org
  2022-07-06  9:25 ` segher at gcc dot gnu.org
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: segher at gcc dot gnu.org @ 2022-07-04 17:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100694

--- Comment #3 from Segher Boessenkool <segher at gcc dot gnu.org> ---
Should this not be handled by the subreg passes?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/100694] PPC: initialization of __int128 is very inefficient
  2021-05-20  7:40 [Bug target/100694] New: PPC: initialization of __int128 is very inefficient jens.seifert at de dot ibm.com
                   ` (2 preceding siblings ...)
  2022-07-04 17:17 ` segher at gcc dot gnu.org
@ 2022-07-06  9:25 ` segher at gcc dot gnu.org
  2022-07-25  8:22 ` guihaoc at gcc dot gnu.org
  2022-07-28  9:11 ` guihaoc at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: segher at gcc dot gnu.org @ 2022-07-06  9:25 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100694

--- Comment #4 from Segher Boessenkool <segher at gcc dot gnu.org> ---
On aarch64 we have (in expand):

;; i_4 = i_3 << 64;

(insn 10 9 11 (set (subreg:DI (reg/v:TI 94 [ i ]) 8)
        (subreg:DI (reg/v:TI 93 [ i ]) 0)) "100694.c":4:6 -1
     (nil))

(insn 11 10 0 (set (subreg:DI (reg/v:TI 94 [ i ]) 0)
        (const_int 0 [0])) "100694.c":4:6 -1
     (nil))

But on rs6000 we get:

;; i_4 = i_3 << 64;

(insn 10 9 11 (set (subreg:DI (reg/v:TI 119 [ i ]) 0)
        (ashift:DI (subreg:DI (reg/v:TI 118 [ i ]) 8)
            (const_int 0 [0]))) "100694.c":4:6 -1
     (nil))

(insn 11 10 0 (set (subreg:DI (reg/v:TI 119 [ i ]) 8)
        (const_int 0 [0])) "100694.c":4:6 -1
     (nil))

What the what.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/100694] PPC: initialization of __int128 is very inefficient
  2021-05-20  7:40 [Bug target/100694] New: PPC: initialization of __int128 is very inefficient jens.seifert at de dot ibm.com
                   ` (3 preceding siblings ...)
  2022-07-06  9:25 ` segher at gcc dot gnu.org
@ 2022-07-25  8:22 ` guihaoc at gcc dot gnu.org
  2022-07-28  9:11 ` guihaoc at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: guihaoc at gcc dot gnu.org @ 2022-07-25  8:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100694

HaoChen Gui <guihaoc at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |guihaoc at gcc dot gnu.org

--- Comment #5 from HaoChen Gui <guihaoc at gcc dot gnu.org> ---
(In reply to Segher Boessenkool from comment #4)
> On aarch64 we have (in expand):
> 
> ;; i_4 = i_3 << 64;
> 
> (insn 10 9 11 (set (subreg:DI (reg/v:TI 94 [ i ]) 8)
>         (subreg:DI (reg/v:TI 93 [ i ]) 0)) "100694.c":4:6 -1
>      (nil))
> 
> (insn 11 10 0 (set (subreg:DI (reg/v:TI 94 [ i ]) 0)
>         (const_int 0 [0])) "100694.c":4:6 -1
>      (nil))
> 
> But on rs6000 we get:
> 
> ;; i_4 = i_3 << 64;
> 
> (insn 10 9 11 (set (subreg:DI (reg/v:TI 119 [ i ]) 0)
>         (ashift:DI (subreg:DI (reg/v:TI 118 [ i ]) 8)
>             (const_int 0 [0]))) "100694.c":4:6 -1
>      (nil))
> 
> (insn 11 10 0 (set (subreg:DI (reg/v:TI 119 [ i ]) 8)
>         (const_int 0 [0])) "100694.c":4:6 -1
>      (nil))
> 
> What the what.

On rs6000, the insn 10 is optimized at forward propagation pass.
test.c.261r.fwprop1:
(insn 10 5 11 2 (set (subreg:DI (reg/v:TI 119 [ i ]) 8)
        (reg/v:DI 122 [ hi ])) "test.c":4:6 670 {*movdi_internal64}
     (expr_list:REG_DEAD (reg:DI 126 [ i ])

Seems aarch64 optimizes it at expand pass.

Now the problem is "ior" operation is done with TImode on rs6000 while it is
done with two subreg:DI on aarch64.  The subreg pass can decomposes the
register which is always used by subreg. If the ior is done with two subreg:DI
on rs6000, it can be optimized by subreg pass. 

on rs6000:
(insn 14 13 15 2 (set (reg:TI 125 [ i ])
        (ior:TI (reg:TI 124 [ lo ])
            (reg/v:TI 119 [ i ]))) "test.c":5:6 494 {*boolti3_internal}

on aarch64
(insn 21 20 22 2 (set (reg:DI 100)
        (ior:DI (subreg:DI (reg:TI 99) 0)
            (subreg:DI (reg/v:TI 94 [ i ]) 0))) "/app/example.c":5:6 521
{iordi3}
(insn 23 22 24 2 (set (reg:DI 101)
        (ior:DI (subreg:DI (reg:TI 99) 8)
            (subreg:DI (reg/v:TI 94 [ i ]) 8))) "/app/example.c":5:6 521
{iordi3}

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/100694] PPC: initialization of __int128 is very inefficient
  2021-05-20  7:40 [Bug target/100694] New: PPC: initialization of __int128 is very inefficient jens.seifert at de dot ibm.com
                   ` (4 preceding siblings ...)
  2022-07-25  8:22 ` guihaoc at gcc dot gnu.org
@ 2022-07-28  9:11 ` guihaoc at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: guihaoc at gcc dot gnu.org @ 2022-07-28  9:11 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100694

--- Comment #6 from HaoChen Gui <guihaoc at gcc dot gnu.org> ---
I made a patch to convert ashift to move when the second operand is const0_rtx.
With the patch, the expand dump is just like aarch64's. But the problem is
still there. 
I tested the patch with SPECint. All the object files are the same as base.
Seems it is always optimized at later passes.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-07-28  9:11 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-20  7:40 [Bug target/100694] New: PPC: initialization of __int128 is very inefficient jens.seifert at de dot ibm.com
2021-05-20 20:28 ` [Bug target/100694] " segher at gcc dot gnu.org
2022-07-04 16:56 ` roger at nextmovesoftware dot com
2022-07-04 17:17 ` segher at gcc dot gnu.org
2022-07-06  9:25 ` segher at gcc dot gnu.org
2022-07-25  8:22 ` guihaoc at gcc dot gnu.org
2022-07-28  9:11 ` guihaoc at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).