public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/106038] New: x86_64 vectorization of ALU ops using xmm registers prematurely
@ 2022-06-20 23:49 goldstein.w.n at gmail dot com
  2022-06-20 23:53 ` [Bug target/106038] " pinskia at gcc dot gnu.org
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: goldstein.w.n at gmail dot com @ 2022-06-20 23:49 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106038

            Bug ID: 106038
           Summary: x86_64 vectorization of ALU ops using xmm registers
                    prematurely
           Product: gcc
           Version: 13.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: goldstein.w.n at gmail dot com
  Target Milestone: ---

See: https://godbolt.org/z/YxWEn6Y65

Basically in all cases where the total amount of memory touched is <= 8 bytes
(word size) the vectorization pass is choosing to inefficiently use xmm
registers to vectorize the unrolled loops. 

GPRs (as GCC <= 9.5 was doing) is faster / less code size.


Related to: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106022

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/106038] x86_64 vectorization of ALU ops using xmm registers prematurely
  2022-06-20 23:49 [Bug target/106038] New: x86_64 vectorization of ALU ops using xmm registers prematurely goldstein.w.n at gmail dot com
@ 2022-06-20 23:53 ` pinskia at gcc dot gnu.org
  2022-06-21  0:01 ` goldstein.w.n at gmail dot com
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: pinskia at gcc dot gnu.org @ 2022-06-20 23:53 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106038

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Created attachment 53175
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53175&action=edit
Testcase -O3 -march=icelake-client

Next time attach the testcase and not link to godbolt without a testcase.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/106038] x86_64 vectorization of ALU ops using xmm registers prematurely
  2022-06-20 23:49 [Bug target/106038] New: x86_64 vectorization of ALU ops using xmm registers prematurely goldstein.w.n at gmail dot com
  2022-06-20 23:53 ` [Bug target/106038] " pinskia at gcc dot gnu.org
@ 2022-06-21  0:01 ` goldstein.w.n at gmail dot com
  2022-06-21  1:28 ` crazylht at gmail dot com
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: goldstein.w.n at gmail dot com @ 2022-06-21  0:01 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106038

--- Comment #2 from Noah Goldstein <goldstein.w.n at gmail dot com> ---
(In reply to Andrew Pinski from comment #1)
> Created attachment 53175 [details]
> Testcase -O3 -march=icelake-client
> 
> Next time attach the testcase and not link to godbolt without a testcase.

Sorry.

I tried playing around in i386.cc to see if I could modify the `stmt_cost` for
the `BIT_{AND|IOR|XOR}_EXPR` cases but that didn't seem to have any effect. Do
you know where I might go to fix this?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/106038] x86_64 vectorization of ALU ops using xmm registers prematurely
  2022-06-20 23:49 [Bug target/106038] New: x86_64 vectorization of ALU ops using xmm registers prematurely goldstein.w.n at gmail dot com
  2022-06-20 23:53 ` [Bug target/106038] " pinskia at gcc dot gnu.org
  2022-06-21  0:01 ` goldstein.w.n at gmail dot com
@ 2022-06-21  1:28 ` crazylht at gmail dot com
  2022-06-21  8:20 ` rguenth at gcc dot gnu.org
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: crazylht at gmail dot com @ 2022-06-21  1:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106038

Hongtao.liu <crazylht at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |crazylht at gmail dot com

--- Comment #3 from Hongtao.liu <crazylht at gmail dot com> ---
vectorizer saw 2 scalar loads + 2 bit_ops + 2 scalar stores vs 1 unaligned_load
+ 1 bit_op + 1 unaligned_store, only scale cost of bit_op doesn't help.

In rtl level, we have

 205(note 3 14 4 2 NOTE_INSN_DELETED)
 206(note 4 3 7 2 NOTE_INSN_FUNCTION_BEG)
 207(insn 7 4 8 2 (set (reg:V2QI 87 [ vect__20.19 ])
 208        (mem:V2QI (reg:DI 91) [0 MEM <const vector(2) unsigned char>
[(const uint8_t *)b_11(D)]+0 S2 A8])) "test.c":31:1 1414 {*movv2qi_internal}
 209     (expr_list:REG_DEAD (reg:DI 91)
 210        (nil)))
 211(insn 8 7 9 2 (set (reg:V2QI 88 [ vect__18.16 ])
 212        (mem:V2QI (reg/v/f:DI 85 [ a ]) [0 MEM <vector(2) unsigned char>
[(uint8_t *)a_10(D)]+0 S2 A8])) "test.c":31:1 1414 {*movv2qi_internal}
 213     (expr_list:REG_EQUIV (mem:V2QI (reg/v/f:DI 85 [ a ]) [0 MEM <vector(2)
unsigned char> [(uint8_t *)a_10(D)]+0 S2 A8])
 214        (nil)))
 215(insn 9 8 10 2 (parallel [
 216            (set (reg:V2QI 89 [ vect__21.20 ])
 217                (xor:V2QI (reg:V2QI 87 [ vect__20.19 ])
 218                    (reg:V2QI 88 [ vect__18.16 ])))
 219            (clobber (reg:CC 17 flags))
 220        ]) "test.c":31:1 1627 {xorv2qi3}
 221     (expr_list:REG_DEAD (reg:V2QI 88 [ vect__18.16 ])
 222        (expr_list:REG_DEAD (reg:V2QI 87 [ vect__20.19 ])
 223            (expr_list:REG_UNUSED (reg:CC 17 flags)
 224                (expr_list:REG_EQUIV (mem:V2QI (reg/v/f:DI 85 [ a ]) [0 MEM
<vector(2) unsigned char> [(uint8_t *)a_10(D)]+0 S2 A8])
 225                    (nil))))))
 226(insn 10 9 0 2 (set (mem:V2QI (reg/v/f:DI 85 [ a ]) [0 MEM <vector(2)
unsigned char> [(uint8_t *)a_10(D)]+0 S2 A8])
 227        (reg:V2QI 89 [ vect__21.20 ])) "test.c":31:1 1414
{*movv2qi_internal}
 228     (expr_list:REG_DEAD (reg:V2QI 89 [ vect__21.20 ])

if RA can allocate 87/88/89 into GPRs, it would same as non-vectorized version.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/106038] x86_64 vectorization of ALU ops using xmm registers prematurely
  2022-06-20 23:49 [Bug target/106038] New: x86_64 vectorization of ALU ops using xmm registers prematurely goldstein.w.n at gmail dot com
                   ` (2 preceding siblings ...)
  2022-06-21  1:28 ` crazylht at gmail dot com
@ 2022-06-21  8:20 ` rguenth at gcc dot gnu.org
  2022-06-21 15:33 ` goldstein.w.n at gmail dot com
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-06-21  8:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106038

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
   Last reconfirmed|                            |2022-06-21
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1
                 CC|                            |rguenth at gcc dot gnu.org
             Blocks|                            |53947
             Target|                            |x86_64-*-*

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
The vectorizer does not anticipate store-merging performing "vectorization" in
GPRs and thus the scalar cost is off (it also doesn't anticipate the different
ISA constraints wrt xmm vs gpr usage).

I wonder if we should try to follow what store-merging would do with respect
to "vector types", thus prefer "general vectors" (but explicitely via integer
types since we can't have vector types with both integer and vector modes)
when possible (for bit operations and plain copies).

scanning over an SLP instance (group) and substituting integer types for
SLP_TREE_VECTYPE might be possible.  Doing this nicely somewhere is going to
be more interesting.  Far away vectorizable_* should compute a set of
{ vector-type, cost } pairs from the set of input operand vector-type[, cost]
pair sets.  Not having "generic" vectors as vector types and relying on
vector lowering to expand them would be an incremental support step for this
I guess.

"backwards STV" could of course also work on the target side.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/106038] x86_64 vectorization of ALU ops using xmm registers prematurely
  2022-06-20 23:49 [Bug target/106038] New: x86_64 vectorization of ALU ops using xmm registers prematurely goldstein.w.n at gmail dot com
                   ` (3 preceding siblings ...)
  2022-06-21  8:20 ` rguenth at gcc dot gnu.org
@ 2022-06-21 15:33 ` goldstein.w.n at gmail dot com
  2022-06-21 15:56 ` crazylht at gmail dot com
  2022-07-22  1:39 ` cvs-commit at gcc dot gnu.org
  6 siblings, 0 replies; 8+ messages in thread
From: goldstein.w.n at gmail dot com @ 2022-06-21 15:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106038

--- Comment #5 from Noah Goldstein <goldstein.w.n at gmail dot com> ---
(In reply to Richard Biener from comment #4)
> The vectorizer does not anticipate store-merging performing "vectorization"
> in GPRs and thus the scalar cost is off (it also doesn't anticipate the
> different
> ISA constraints wrt xmm vs gpr usage).
> 
> I wonder if we should try to follow what store-merging would do with respect
> to "vector types", thus prefer "general vectors" (but explicitely via integer
> types since we can't have vector types with both integer and vector modes)
> when possible (for bit operations and plain copies).
> 
> scanning over an SLP instance (group) and substituting integer types for
> SLP_TREE_VECTYPE might be possible.  Doing this nicely somewhere is going to
> be more interesting.  Far away vectorizable_* should compute a set of
> { vector-type, cost } pairs from the set of input operand vector-type[, cost]
> pair sets.  Not having "generic" vectors as vector types and relying on
> vector lowering to expand them would be an incremental support step for this
> I guess.
> 
> "backwards STV" could of course also work on the target side.

backwards STV?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/106038] x86_64 vectorization of ALU ops using xmm registers prematurely
  2022-06-20 23:49 [Bug target/106038] New: x86_64 vectorization of ALU ops using xmm registers prematurely goldstein.w.n at gmail dot com
                   ` (4 preceding siblings ...)
  2022-06-21 15:33 ` goldstein.w.n at gmail dot com
@ 2022-06-21 15:56 ` crazylht at gmail dot com
  2022-07-22  1:39 ` cvs-commit at gcc dot gnu.org
  6 siblings, 0 replies; 8+ messages in thread
From: crazylht at gmail dot com @ 2022-06-21 15:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106038

--- Comment #6 from Hongtao.liu <crazylht at gmail dot com> ---

> backwards STV?

There's a pass in x86 backend called STV(scalar to vector, the pass convert
scalar instructions into vector mode when profitable), I guess "backwards STV"
means converting vector instruction back to scalar mode.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/106038] x86_64 vectorization of ALU ops using xmm registers prematurely
  2022-06-20 23:49 [Bug target/106038] New: x86_64 vectorization of ALU ops using xmm registers prematurely goldstein.w.n at gmail dot com
                   ` (5 preceding siblings ...)
  2022-06-21 15:56 ` crazylht at gmail dot com
@ 2022-07-22  1:39 ` cvs-commit at gcc dot gnu.org
  6 siblings, 0 replies; 8+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-07-22  1:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106038

--- Comment #7 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:

https://gcc.gnu.org/g:605b64251c78f29da32ed807413971339f27d13b

commit r13-1790-g605b64251c78f29da32ed807413971339f27d13b
Author: liuhongt <hongtao.liu@intel.com>
Date:   Thu Jul 7 14:33:32 2022 +0800

    Extend 16/32-bit vector bit_op patterns with (m,0,i) alternative.

    And split it after reload.

    gcc/ChangeLog:

            PR target/106038
            * config/i386/mmx.md (<code><mode>3): New define_expand, it's
            original "<code><mode>3".
            (*<code><mode>3): New define_insn, it's original
            "<code><mode>3" be extended to handle memory and immediate
            operand with ix86_binary_operator_ok. Also adjust define_split
            after it.
            (mmxinsnmode): New mode attribute.
            (*mov<mode>_imm): Refactor with mmxinsnmode.
            * config/i386/predicates.md
            (register_or_x86_64_const_vector_operand): New predicate.

    gcc/testsuite/ChangeLog:

            * gcc.target/i386/pr106038-1.c: New test.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-07-22  1:39 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-20 23:49 [Bug target/106038] New: x86_64 vectorization of ALU ops using xmm registers prematurely goldstein.w.n at gmail dot com
2022-06-20 23:53 ` [Bug target/106038] " pinskia at gcc dot gnu.org
2022-06-21  0:01 ` goldstein.w.n at gmail dot com
2022-06-21  1:28 ` crazylht at gmail dot com
2022-06-21  8:20 ` rguenth at gcc dot gnu.org
2022-06-21 15:33 ` goldstein.w.n at gmail dot com
2022-06-21 15:56 ` crazylht at gmail dot com
2022-07-22  1:39 ` cvs-commit at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).