public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/96933] New: inefficient code for char/short vec CTOR
@ 2020-09-04  9:31 linkw at gcc dot gnu.org
  2020-09-04  9:33 ` [Bug target/96933] rs6000: " linkw at gcc dot gnu.org
                   ` (13 more replies)
  0 siblings, 14 replies; 15+ messages in thread
From: linkw at gcc dot gnu.org @ 2020-09-04  9:31 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933

            Bug ID: 96933
           Summary: inefficient code for char/short vec CTOR
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: linkw at gcc dot gnu.org
  Target Milestone: ---

When I'm investigate the vectorization cost for vec_construct, I happened to
find the generated code for vector construction is inefficient with DIRECT_MOVE
support.

The test case looks like:

vector unsigned char test_char(unsigned char f1, unsigned char f2,
                               unsigned char f3, unsigned char f4,
                               unsigned char f5, unsigned char f6,
                               unsigned char f7, unsigned char f8,
                               unsigned char f9, unsigned char f10,
                               unsigned char f11, unsigned char f12,
                               unsigned char f13, unsigned char f14,
                               unsigned char f15, unsigned char f16) {

  vector unsigned char v = {f1, f2,  f3,  f4,  f5,  f6,  f7,  f8,
                            f9, f10, f11, f12, f13, f14, f15, f16};
  return v;
}

The generated code currently with -mcpu=power9:

0000000000000000 <test_char>:
   0:   e8 ff a1 fb     std     r29,-24(r1)
   4:   f0 ff c1 fb     std     r30,-16(r1)
   8:   f8 ff e1 fb     std     r31,-8(r1)
   c:   60 00 a1 8b     lbz     r29,96(r1)
  10:   68 00 c1 8b     lbz     r30,104(r1)
  14:   70 00 e1 8b     lbz     r31,112(r1)
  18:   d1 ff 81 98     stb     r4,-47(r1)
  1c:   d2 ff a1 98     stb     r5,-46(r1)
  20:   78 00 81 89     lbz     r12,120(r1)
  24:   80 00 01 88     lbz     r0,128(r1)
  28:   88 00 61 89     lbz     r11,136(r1)
  2c:   90 00 81 88     lbz     r4,144(r1)
  30:   98 00 a1 88     lbz     r5,152(r1)
  34:   d0 ff 61 98     stb     r3,-48(r1)
  38:   d3 ff c1 98     stb     r6,-45(r1)
  3c:   d4 ff e1 98     stb     r7,-44(r1)
  40:   d8 ff a1 9b     stb     r29,-40(r1)
  44:   d5 ff 01 99     stb     r8,-43(r1)
  48:   d6 ff 21 99     stb     r9,-42(r1)
  4c:   d7 ff 41 99     stb     r10,-41(r1)
  50:   d9 ff c1 9b     stb     r30,-39(r1)
  54:   da ff e1 9b     stb     r31,-38(r1)
  58:   db ff 81 99     stb     r12,-37(r1)
  5c:   dc ff 01 98     stb     r0,-36(r1)
  60:   dd ff 61 99     stb     r11,-35(r1)
  64:   de ff 81 98     stb     r4,-34(r1)
  68:   df ff a1 98     stb     r5,-33(r1)
  6c:   e8 ff a1 eb     ld      r29,-24(r1)
  70:   f0 ff c1 eb     ld      r30,-16(r1)
  74:   f8 ff e1 eb     ld      r31,-8(r1)
  78:   d9 ff 41 f4     lxv     vs34,-48(r1)
  7c:   20 00 80 4e     blr

But it can be more efficient with direct move and vector merge, such as:

   0:   67 01 43 7c     mtvsrd  vs34,r3
   4:   68 00 61 80     lwz     r3,104(r1)
   8:   60 00 61 81     lwz     r11,96(r1)
   c:   67 01 64 7c     mtvsrd  vs35,r4
  10:   70 00 81 80     lwz     r4,112(r1)
  14:   67 01 03 7d     mtvsrd  vs40,r3
  18:   78 00 61 80     lwz     r3,120(r1)
  1c:   67 01 85 7c     mtvsrd  vs36,r5
  20:   67 01 a6 7c     mtvsrd  vs37,r6
  24:   67 01 07 7c     mtvsrd  vs32,r7
  28:   67 01 28 7c     mtvsrd  vs33,r8
  2c:   67 01 24 7d     mtvsrd  vs41,r4
  30:   80 00 81 80     lwz     r4,128(r1)
  34:   0c 10 43 10     vmrghb  v2,v3,v2
  38:   67 01 63 7c     mtvsrd  vs35,r3
  3c:   88 00 61 80     lwz     r3,136(r1)
  40:   67 01 eb 7c     mtvsrd  vs39,r11
  44:   0c 20 85 10     vmrghb  v4,v5,v4
  48:   67 01 a4 7c     mtvsrd  vs37,r4
  4c:   90 00 81 80     lwz     r4,144(r1)
  50:   0c 00 01 10     vmrghb  v0,v1,v0
  54:   67 01 23 7c     mtvsrd  vs33,r3
  58:   98 00 61 80     lwz     r3,152(r1)
  5c:   67 01 c9 7c     mtvsrd  vs38,r9
  60:   0c 38 e8 10     vmrghb  v7,v8,v7
  64:   67 01 04 7d     mtvsrd  vs40,r4
  68:   0c 48 63 10     vmrghb  v3,v3,v9
  6c:   67 01 23 7d     mtvsrd  vs41,r3
  70:   0c 28 a1 10     vmrghb  v5,v1,v5
  74:   67 01 2a 7c     mtvsrd  vs33,r10
  78:   0c 40 09 11     vmrghb  v8,v9,v8
  7c:   0c 30 21 10     vmrghb  v1,v1,v6
  80:   4c 11 44 10     vmrglh  v2,v4,v2
  84:   4c 39 63 10     vmrglh  v3,v3,v7
  88:   4c 29 88 10     vmrglh  v4,v8,v5
  8c:   4c 01 a1 10     vmrglh  v5,v1,v0
  90:   8c 19 64 10     vmrglw  v3,v4,v3
  94:   8c 11 45 10     vmrglw  v2,v5,v2
  98:   57 13 43 f0     xxmrgld vs34,vs35,vs34

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/96933] rs6000: inefficient code for char/short vec CTOR
  2020-09-04  9:31 [Bug target/96933] New: inefficient code for char/short vec CTOR linkw at gcc dot gnu.org
@ 2020-09-04  9:33 ` linkw at gcc dot gnu.org
  2020-09-04 10:26 ` segher at gcc dot gnu.org
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: linkw at gcc dot gnu.org @ 2020-09-04  9:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933

Kewen Lin <linkw at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |ASSIGNED
                 CC|                            |bergner at gcc dot gnu.org,
                   |                            |linkw at gcc dot gnu.org,
                   |                            |segher at gcc dot gnu.org,
                   |                            |wschmidt at gcc dot gnu.org
            Summary|inefficient code for        |rs6000: inefficient code
                   |char/short vec CTOR         |for char/short vec CTOR
   Last reconfirmed|                            |2020-09-04
             Target|                            |powerpc
     Ever confirmed|0                           |1
           Assignee|unassigned at gcc dot gnu.org      |linkw at gcc dot gnu.org

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/96933] rs6000: inefficient code for char/short vec CTOR
  2020-09-04  9:31 [Bug target/96933] New: inefficient code for char/short vec CTOR linkw at gcc dot gnu.org
  2020-09-04  9:33 ` [Bug target/96933] rs6000: " linkw at gcc dot gnu.org
@ 2020-09-04 10:26 ` segher at gcc dot gnu.org
  2020-09-04 10:46 ` linkw at gcc dot gnu.org
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: segher at gcc dot gnu.org @ 2020-09-04 10:26 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933

--- Comment #1 from Segher Boessenkool <segher at gcc dot gnu.org> ---
Is that actually faster though?  The original has shorter dependency
chains.  Or is this to avoid some LHS/SHL?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/96933] rs6000: inefficient code for char/short vec CTOR
  2020-09-04  9:31 [Bug target/96933] New: inefficient code for char/short vec CTOR linkw at gcc dot gnu.org
  2020-09-04  9:33 ` [Bug target/96933] rs6000: " linkw at gcc dot gnu.org
  2020-09-04 10:26 ` segher at gcc dot gnu.org
@ 2020-09-04 10:46 ` linkw at gcc dot gnu.org
  2020-09-04 12:06 ` rguenth at gcc dot gnu.org
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: linkw at gcc dot gnu.org @ 2020-09-04 10:46 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933

--- Comment #2 from Kewen Lin <linkw at gcc dot gnu.org> ---
(In reply to Segher Boessenkool from comment #1)
> Is that actually faster though?  The original has shorter dependency
> chains.  Or is this to avoid some LHS/SHL?

Yes, I tested it with one constructed case, the original version takes 18.20s
while the optimized version takes 8.40s. And yes, I guess it's due to LHS/SHL
similar to the vec_insert issue xionghu is working on.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/96933] rs6000: inefficient code for char/short vec CTOR
  2020-09-04  9:31 [Bug target/96933] New: inefficient code for char/short vec CTOR linkw at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2020-09-04 10:46 ` linkw at gcc dot gnu.org
@ 2020-09-04 12:06 ` rguenth at gcc dot gnu.org
  2020-09-04 13:04 ` segher at gcc dot gnu.org
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-09-04 12:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
very likely the byte stores and then the following vector load will also
trigger
STLF issues.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/96933] rs6000: inefficient code for char/short vec CTOR
  2020-09-04  9:31 [Bug target/96933] New: inefficient code for char/short vec CTOR linkw at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2020-09-04 12:06 ` rguenth at gcc dot gnu.org
@ 2020-09-04 13:04 ` segher at gcc dot gnu.org
  2020-09-07  2:39 ` linkw at gcc dot gnu.org
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: segher at gcc dot gnu.org @ 2020-09-04 13:04 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933

--- Comment #4 from Segher Boessenkool <segher at gcc dot gnu.org> ---
Yes, timing suggests there is some SHL/LHS flush.

On p9 and later we can use mtvsrdd instead of mtvsrd (moving two
bytes into place at one), which reduces the number of moves from
16 to 8, and the number of merges from 15 to 7 (and reduces path
length by 1).  This sounds like a no-brainer win with that :-)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/96933] rs6000: inefficient code for char/short vec CTOR
  2020-09-04  9:31 [Bug target/96933] New: inefficient code for char/short vec CTOR linkw at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2020-09-04 13:04 ` segher at gcc dot gnu.org
@ 2020-09-07  2:39 ` linkw at gcc dot gnu.org
  2020-09-07  7:26 ` linkw at gcc dot gnu.org
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: linkw at gcc dot gnu.org @ 2020-09-07  2:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933

--- Comment #5 from Kewen Lin <linkw at gcc dot gnu.org> ---
(In reply to Segher Boessenkool from comment #4)
> Yes, timing suggests there is some SHL/LHS flush.
> 
> On p9 and later we can use mtvsrdd instead of mtvsrd (moving two
> bytes into place at one), which reduces the number of moves from
> 16 to 8, and the number of merges from 15 to 7 (and reduces path
> length by 1).  This sounds like a no-brainer win with that :-)

Good idea, it looks better on P9. One thing to double confirm, currently there
are no instructions like vmrgob and vmrgoh, so of the mergings you mentioned
from vector bytes to vector short and vector short to vector word needs
artificial control vector?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/96933] rs6000: inefficient code for char/short vec CTOR
  2020-09-04  9:31 [Bug target/96933] New: inefficient code for char/short vec CTOR linkw at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2020-09-07  2:39 ` linkw at gcc dot gnu.org
@ 2020-09-07  7:26 ` linkw at gcc dot gnu.org
  2020-09-07 15:14 ` segher at gcc dot gnu.org
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: linkw at gcc dot gnu.org @ 2020-09-07  7:26 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933

--- Comment #6 from Kewen Lin <linkw at gcc dot gnu.org> ---
(In reply to Kewen Lin from comment #5)
> (In reply to Segher Boessenkool from comment #4)
> > Yes, timing suggests there is some SHL/LHS flush.
> > 
> > On p9 and later we can use mtvsrdd instead of mtvsrd (moving two
> > bytes into place at one), which reduces the number of moves from
> > 16 to 8, and the number of merges from 15 to 7 (and reduces path
> > length by 1).  This sounds like a no-brainer win with that :-)
> 
> Good idea, it looks better on P9. One thing to double confirm, currently
> there are no instructions like vmrgob and vmrgoh, so of the mergings you
> mentioned from vector bytes to vector short and vector short to vector word
> needs artificial control vector?

Improve the patch to support mtvsrdd, the asm for char looks like:

0000000000000000 <test_char>:
   0:   00 00 4c 3c     addis   r2,r12,0
                        0: R_PPC64_REL16_HA     .TOC.
   4:   00 00 42 38     addi    r2,r2,0
                        4: R_PPC64_REL16_LO     .TOC.+0x4
   8:   e8 ff a1 fb     std     r29,-24(r1)
   c:   00 00 a2 3f     addis   r29,r2,0
                        c: R_PPC64_TOC16_HA     .rodata.cst16
  10:   f0 ff c1 fb     std     r30,-16(r1)
  14:   f8 ff e1 fb     std     r31,-8(r1)
  18:   67 1b 24 7c     mtvsrdd vs33,r4,r3
  1c:   67 3b 28 7d     mtvsrdd vs41,r8,r7
  20:   68 00 c1 8b     lbz     r30,104(r1)
  24:   78 00 e1 8b     lbz     r31,120(r1)
  28:   00 00 bd 3b     addi    r29,r29,0
                        28: R_PPC64_TOC16_LO    .rodata.cst16
  2c:   60 00 81 89     lbz     r12,96(r1)
  30:   70 00 61 89     lbz     r11,112(r1)
  34:   80 00 81 88     lbz     r4,128(r1)
  38:   88 00 61 88     lbz     r3,136(r1)
  3c:   90 00 01 89     lbz     r8,144(r1)
  40:   98 00 e1 88     lbz     r7,152(r1)
  44:   67 2b 46 7c     mtvsrdd vs34,r6,r5
  48:   67 4b aa 7d     mtvsrdd vs45,r10,r9
  4c:   09 00 9d f5     lxv     vs44,0(r29)
  50:   67 63 5e 7d     mtvsrdd vs42,r30,r12
  54:   67 5b 1f 7c     mtvsrdd vs32,r31,r11
  58:   e8 ff a1 eb     ld      r29,-24(r1)
  5c:   f0 ff c1 eb     ld      r30,-16(r1)
  60:   67 23 63 7d     mtvsrdd vs43,r3,r4
  64:   f8 ff e1 eb     ld      r31,-8(r1)
  68:   3b 0b 42 10     vpermr  v2,v2,v1,v12
  6c:   67 43 27 7c     mtvsrdd vs33,r7,r8
  70:   3b 4b ad 11     vpermr  v13,v13,v9,v12
  74:   3b 53 00 10     vpermr  v0,v0,v10,v12
  78:   3b 5b 21 10     vpermr  v1,v1,v11,v12
  7c:   97 11 4d f0     xxmrglw vs34,vs45,vs34
  80:   97 01 01 f0     xxmrglw vs32,vs33,vs32
  84:   57 13 40 f0     xxmrgld vs34,vs32,vs34
  88:   20 00 80 4e     blr

For:
  1) mtvsrdd under TARGET_DIRECT_MOVE_128
  2) mtvsrd under  TARGET_DIRECT_MOVE
  3) original

The time evaluation on Power9 looks like
  1) 7.28s
  2) 7.41s
  3) 18.19s

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/96933] rs6000: inefficient code for char/short vec CTOR
  2020-09-04  9:31 [Bug target/96933] New: inefficient code for char/short vec CTOR linkw at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2020-09-07  7:26 ` linkw at gcc dot gnu.org
@ 2020-09-07 15:14 ` segher at gcc dot gnu.org
  2020-09-08  5:26 ` linkw at gcc dot gnu.org
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: segher at gcc dot gnu.org @ 2020-09-07 15:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933

--- Comment #7 from Segher Boessenkool <segher at gcc dot gnu.org> ---
There are vmrglb and vrghb etc.?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/96933] rs6000: inefficient code for char/short vec CTOR
  2020-09-04  9:31 [Bug target/96933] New: inefficient code for char/short vec CTOR linkw at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2020-09-07 15:14 ` segher at gcc dot gnu.org
@ 2020-09-08  5:26 ` linkw at gcc dot gnu.org
  2020-09-08 18:30 ` segher at gcc dot gnu.org
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: linkw at gcc dot gnu.org @ 2020-09-08  5:26 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933

--- Comment #8 from Kewen Lin <linkw at gcc dot gnu.org> ---
(In reply to Segher Boessenkool from comment #7)
> There are vmrglb and vrghb etc.?

But these are only for low/high part separately, with mtvsrdd both low/high
parts (doubleword) have the values, we don't have Vector Merge Even/Odd for
char or short to merge them. Now I used one artificial control vector for the
merging, correct me if I miss something.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/96933] rs6000: inefficient code for char/short vec CTOR
  2020-09-04  9:31 [Bug target/96933] New: inefficient code for char/short vec CTOR linkw at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2020-09-08  5:26 ` linkw at gcc dot gnu.org
@ 2020-09-08 18:30 ` segher at gcc dot gnu.org
  2020-09-09  5:20 ` linkw at gcc dot gnu.org
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: segher at gcc dot gnu.org @ 2020-09-08 18:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933

--- Comment #9 from Segher Boessenkool <segher at gcc dot gnu.org> ---
I'm not sure what you mean.

vmrglb merges the vectors
  abcdefghijklmnop
and
  ABCDEFGHIJKLMNOP
to
  iIjJkKlLmMnNoOpP

... ah, I see what you mean I guess.

So, use something else instead?  How about vpku*um?

First vpkudum, xforming
  xxxxxxxAxxxxxxxB
and
  xxxxxxxCxxxxxxxD
into
  xxxAxxxBxxxCxxxD

and then vpkuwum:
  xxxAxxxBxxxCxxxD
and
  xxxExxxFxxxGxxxH
into
  xAxBxCxDxExFxGxH

and finally vpkuhum:
  xAxBxCxDxExFxGxH
and
  xIxJxKxLxMxNxOxP
into
  ABCDEFGHIJKLMNOP

?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/96933] rs6000: inefficient code for char/short vec CTOR
  2020-09-04  9:31 [Bug target/96933] New: inefficient code for char/short vec CTOR linkw at gcc dot gnu.org
                   ` (9 preceding siblings ...)
  2020-09-08 18:30 ` segher at gcc dot gnu.org
@ 2020-09-09  5:20 ` linkw at gcc dot gnu.org
  2020-11-05  8:09 ` cvs-commit at gcc dot gnu.org
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: linkw at gcc dot gnu.org @ 2020-09-09  5:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933

--- Comment #10 from Kewen Lin <linkw at gcc dot gnu.org> ---
(In reply to Segher Boessenkool from comment #9)
> I'm not sure what you mean.
> 
> vmrglb merges the vectors
>   abcdefghijklmnop
> and
>   ABCDEFGHIJKLMNOP
> to
>   iIjJkKlLmMnNoOpP
> 
> ... ah, I see what you mean I guess.
> 
> So, use something else instead?  How about vpku*um?
> 
> First vpkudum, xforming
>   xxxxxxxAxxxxxxxB
> and
>   xxxxxxxCxxxxxxxD
> into
>   xxxAxxxBxxxCxxxD
> 
> and then vpkuwum:
>   xxxAxxxBxxxCxxxD
> and
>   xxxExxxFxxxGxxxH
> into
>   xAxBxCxDxExFxGxH
> 
> and finally vpkuhum:
>   xAxBxCxDxExFxGxH
> and
>   xIxJxKxLxMxNxOxP
> into
>   ABCDEFGHIJKLMNOP
> 
> ?

Great, it works! Thanks for the advice. By testing, for type char, it's on par
with the artificial control vector version, 7.30s vs. 7.28s, while for type
short, it's better, 28.66s vs. 31.52s. Will update the sent patch to V2.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/96933] rs6000: inefficient code for char/short vec CTOR
  2020-09-04  9:31 [Bug target/96933] New: inefficient code for char/short vec CTOR linkw at gcc dot gnu.org
                   ` (10 preceding siblings ...)
  2020-09-09  5:20 ` linkw at gcc dot gnu.org
@ 2020-11-05  8:09 ` cvs-commit at gcc dot gnu.org
  2020-11-05  8:42 ` linkw at gcc dot gnu.org
  2020-11-06 22:14 ` cvs-commit at gcc dot gnu.org
  13 siblings, 0 replies; 15+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2020-11-05  8:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933

--- Comment #11 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Kewen Lin <linkw@gcc.gnu.org>:

https://gcc.gnu.org/g:025f434a87336e38bf5140fba2005081876aa911

commit r11-4731-g025f434a87336e38bf5140fba2005081876aa911
Author: Kewen Lin <linkw@linux.ibm.com>
Date:   Thu Nov 5 00:04:10 2020 -0600

    rs6000: Use direct move for char/short vector CTOR [PR96933]

    This patch is to make vector CTOR with char/short leverage direct
    move instructions when they are available.  With one constructed
    test case, it can speed up 145% for char and 190% for short on P9.

    Tested SPEC2017 x264_r at -Ofast on P9, it gets 1.61% speedup
    (but based on unexpected SLP see PR96789).

    Bootstrapped/regtested on powerpc64{,le}-linux-gnu P8 and
    powerpc64le-linux-gnu P9.

    gcc/ChangeLog:

            PR target/96933
            * config/rs6000/rs6000.c (rs6000_expand_vector_init): Use direct
move
            instructions for vector construction with char/short types.
            * config/rs6000/rs6000.md (p8_mtvsrwz_v16qisi2): New define_insn.
            (p8_mtvsrd_v16qidi2): Likewise.

    gcc/testsuite/ChangeLog:

            PR target/96933
            * gcc.target/powerpc/pr96933-1.c: New test.
            * gcc.target/powerpc/pr96933-2.c: New test.
            * gcc.target/powerpc/pr96933-3.c: New test.
            * gcc.target/powerpc/pr96933-4.c: New test.
            * gcc.target/powerpc/pr96933.h: New test.
            * gcc.target/powerpc/pr96933-run.h: New test.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/96933] rs6000: inefficient code for char/short vec CTOR
  2020-09-04  9:31 [Bug target/96933] New: inefficient code for char/short vec CTOR linkw at gcc dot gnu.org
                   ` (11 preceding siblings ...)
  2020-11-05  8:09 ` cvs-commit at gcc dot gnu.org
@ 2020-11-05  8:42 ` linkw at gcc dot gnu.org
  2020-11-06 22:14 ` cvs-commit at gcc dot gnu.org
  13 siblings, 0 replies; 15+ messages in thread
From: linkw at gcc dot gnu.org @ 2020-11-05  8:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933

Kewen Lin <linkw at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|---                         |FIXED

--- Comment #12 from Kewen Lin <linkw at gcc dot gnu.org> ---
Should be done with latest trunk.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/96933] rs6000: inefficient code for char/short vec CTOR
  2020-09-04  9:31 [Bug target/96933] New: inefficient code for char/short vec CTOR linkw at gcc dot gnu.org
                   ` (12 preceding siblings ...)
  2020-11-05  8:42 ` linkw at gcc dot gnu.org
@ 2020-11-06 22:14 ` cvs-commit at gcc dot gnu.org
  13 siblings, 0 replies; 15+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2020-11-06 22:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933

--- Comment #13 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Segher Boessenkool <segher@gcc.gnu.org>:

https://gcc.gnu.org/g:e5502ae72f784470019de5850017ad0c87ffacef

commit r11-4805-ge5502ae72f784470019de5850017ad0c87ffacef
Author: Segher Boessenkool <segher@kernel.crashing.org>
Date:   Fri Nov 6 12:50:35 2020 +0000

    rs6000: Fix TARGET_POWERPC64 vs. TARGET_64BIT confusion

    I gave Ke Wen bad advice, luckily David corrected me: it is true that we
    cannot use TARGET_POWERPC64 on many 32-bit OSes, since either the kernel
    or userland does not save the top half of the 64-bit integer registers,
    but we do not have to care about that in separate patterns or related
    code.  The flag is automatically not enabled by default on targets that
    do not handle this correctly.

    This patch fixes it.

    Segher

    2020-11-06  Segher Boessenkool  <segher@kernel.crashing.org>

            PR target/96933
            * config/rs6000/rs6000.c (rs6000_expand_vector_init): Use
            TARGET_POWERPC64 instead of TARGET_64BIT.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2020-11-06 22:14 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-04  9:31 [Bug target/96933] New: inefficient code for char/short vec CTOR linkw at gcc dot gnu.org
2020-09-04  9:33 ` [Bug target/96933] rs6000: " linkw at gcc dot gnu.org
2020-09-04 10:26 ` segher at gcc dot gnu.org
2020-09-04 10:46 ` linkw at gcc dot gnu.org
2020-09-04 12:06 ` rguenth at gcc dot gnu.org
2020-09-04 13:04 ` segher at gcc dot gnu.org
2020-09-07  2:39 ` linkw at gcc dot gnu.org
2020-09-07  7:26 ` linkw at gcc dot gnu.org
2020-09-07 15:14 ` segher at gcc dot gnu.org
2020-09-08  5:26 ` linkw at gcc dot gnu.org
2020-09-08 18:30 ` segher at gcc dot gnu.org
2020-09-09  5:20 ` linkw at gcc dot gnu.org
2020-11-05  8:09 ` cvs-commit at gcc dot gnu.org
2020-11-05  8:42 ` linkw at gcc dot gnu.org
2020-11-06 22:14 ` cvs-commit at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).