[Bug target/82731] _mm256_set_epi8(array[offset[0]], array[offset[1]], ...) byte gather makes slow code, trying to zero-extend all the uint16

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug target/82731] _mm256_set_epi8(array[offset[0]], array[offset[1]], ...) byte gather makes slow code, trying to zero-extend all the uint16_t offsets first and spilling them.
       [not found] <bug-82731-4@http.gcc.gnu.org/bugzilla/>
@ 2024-04-15  7:01 ` pinskia at gcc dot gnu.org
  2024-04-15  7:04 ` pinskia at gcc dot gnu.org
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-04-15  7:01 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82731

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2024-04-15
           Severity|normal                      |enhancement
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Confirmed. This comes down to having a scheduler that reduces live ranges much
more agressively.

Adding -fschedule-insns helps slightly but not enough in this case.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/82731] _mm256_set_epi8(array[offset[0]], array[offset[1]], ...) byte gather makes slow code, trying to zero-extend all the uint16_t offsets first and spilling them.
       [not found] <bug-82731-4@http.gcc.gnu.org/bugzilla/>
  2024-04-15  7:01 ` [Bug target/82731] _mm256_set_epi8(array[offset[0]], array[offset[1]], ...) byte gather makes slow code, trying to zero-extend all the uint16_t offsets first and spilling them pinskia at gcc dot gnu.org
@ 2024-04-15  7:04 ` pinskia at gcc dot gnu.org
  2024-04-17  8:50 ` liuhongt at gcc dot gnu.org
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-04-15  7:04 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82731

--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Note you can reproduce the same issue with SSE2 (and not just AVX):
```

#define vect16 __attribute__((vector_size(16)))

vect16 char gather(char *array, unsigned short *offset) {

  return (vect16 char){array[offset[0]], array[offset[1]], array[offset[2]],
array[offset[3]], array[offset[4]], array[offset[5]], array[offset[6]],
array[offset[7]],
      array[offset[8]],array[offset[9]],array[offset[10]],array[offset[11]],
array[offset[12]], array[offset[13]], array[offset[14]]};
}
```

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/82731] _mm256_set_epi8(array[offset[0]], array[offset[1]], ...) byte gather makes slow code, trying to zero-extend all the uint16_t offsets first and spilling them.
       [not found] <bug-82731-4@http.gcc.gnu.org/bugzilla/>
  2024-04-15  7:01 ` [Bug target/82731] _mm256_set_epi8(array[offset[0]], array[offset[1]], ...) byte gather makes slow code, trying to zero-extend all the uint16_t offsets first and spilling them pinskia at gcc dot gnu.org
  2024-04-15  7:04 ` pinskia at gcc dot gnu.org
@ 2024-04-17  8:50 ` liuhongt at gcc dot gnu.org
  2024-04-17  9:30 ` liuhongt at gcc dot gnu.org
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 7+ messages in thread
From: liuhongt at gcc dot gnu.org @ 2024-04-17  8:50 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82731

Hongtao Liu <liuhongt at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |liuhongt at gcc dot gnu.org

--- Comment #3 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
Looks like ix86_vect_estimate_reg_pressure doesn't work here, taking a look.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/82731] _mm256_set_epi8(array[offset[0]], array[offset[1]], ...) byte gather makes slow code, trying to zero-extend all the uint16_t offsets first and spilling them.
       [not found] <bug-82731-4@http.gcc.gnu.org/bugzilla/>
                   ` (2 preceding siblings ...)
  2024-04-17  8:50 ` liuhongt at gcc dot gnu.org
@ 2024-04-17  9:30 ` liuhongt at gcc dot gnu.org
  2024-04-17  9:45 ` rguenth at gcc dot gnu.org
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 7+ messages in thread
From: liuhongt at gcc dot gnu.org @ 2024-04-17  9:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82731

--- Comment #4 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
(In reply to Hongtao Liu from comment #3)
> Looks like ix86_vect_estimate_reg_pressure doesn't work here, taking a look.

Oh, ix86_vect_estimate_reg_pressure is only for loop, BB vectorizer only use
ix86_builtin_vectorization_cost, but not add_stmt_cost/finish_cost.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/82731] _mm256_set_epi8(array[offset[0]], array[offset[1]], ...) byte gather makes slow code, trying to zero-extend all the uint16_t offsets first and spilling them.
       [not found] <bug-82731-4@http.gcc.gnu.org/bugzilla/>
                   ` (3 preceding siblings ...)
  2024-04-17  9:30 ` liuhongt at gcc dot gnu.org
@ 2024-04-17  9:45 ` rguenth at gcc dot gnu.org
  2024-04-17 10:36 ` rguenth at gcc dot gnu.org
  2024-04-17 10:59 ` liuhongt at gcc dot gnu.org
  6 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-04-17  9:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82731

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
We do not BB vectorize gathers I think (ISTR some "loop" uses in the
infrastructure, not too difficult to fix I guess).

In the end the problem is RTL expansion of the CTOR and then lack of
combine?

Look at how we RTL expand

typedef char __v32qi __attribute__((vector_size(32)));

__v32qi
_mm256_set_epi8  (char __q31, char __q30, char __q29, char __q28,
                  char __q27, char __q26, char __q25, char __q24,
                  char __q23, char __q22, char __q21, char __q20,
                  char __q19, char __q18, char __q17, char __q16,
                  char __q15, char __q14, char __q13, char __q12,
                  char __q11, char __q10, char __q09, char __q08,
                  char __q07, char __q06, char __q05, char __q04,
                  char __q03, char __q02, char __q01, char __q00)
{
  return __extension__ (__v32qi){
    __q00, __q01, __q02, __q03, __q04, __q05, __q06, __q07,
    __q08, __q09, __q10, __q11, __q12, __q13, __q14, __q15,
    __q16, __q17, __q18, __q19, __q20, __q21, __q22, __q23,
    __q24, __q25, __q26, __q27, __q28, __q29, __q30, __q31
  };
}

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/82731] _mm256_set_epi8(array[offset[0]], array[offset[1]], ...) byte gather makes slow code, trying to zero-extend all the uint16_t offsets first and spilling them.
       [not found] <bug-82731-4@http.gcc.gnu.org/bugzilla/>
                   ` (4 preceding siblings ...)
  2024-04-17  9:45 ` rguenth at gcc dot gnu.org
@ 2024-04-17 10:36 ` rguenth at gcc dot gnu.org
  2024-04-17 10:59 ` liuhongt at gcc dot gnu.org
  6 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-04-17 10:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82731

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
That's ix86_expand_vector_init_interleave which for QI inner_mode extends
to SImode, likely because it tries to work with just SSE2?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/82731] _mm256_set_epi8(array[offset[0]], array[offset[1]], ...) byte gather makes slow code, trying to zero-extend all the uint16_t offsets first and spilling them.
       [not found] <bug-82731-4@http.gcc.gnu.org/bugzilla/>
                   ` (5 preceding siblings ...)
  2024-04-17 10:36 ` rguenth at gcc dot gnu.org
@ 2024-04-17 10:59 ` liuhongt at gcc dot gnu.org
  6 siblings, 0 replies; 7+ messages in thread
From: liuhongt at gcc dot gnu.org @ 2024-04-17 10:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82731

--- Comment #7 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
(In reply to Hongtao Liu from comment #4)
> (In reply to Hongtao Liu from comment #3)
> > Looks like ix86_vect_estimate_reg_pressure doesn't work here, taking a look.
> 
> Oh, ix86_vect_estimate_reg_pressure is only for loop, BB vectorizer only use
> ix86_builtin_vectorization_cost, but not add_stmt_cost/finish_cost.

Oh, CTOR comes from source code, not from vectorizer.
Then why those loads from offset is not moved just before consumer(loads from
array), then the live range of those values can be shorten.(loads from array
are moved just before CTOR insns).

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-04-17 10:59 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <bug-82731-4@http.gcc.gnu.org/bugzilla/>
2024-04-15  7:01 ` [Bug target/82731] _mm256_set_epi8(array[offset[0]], array[offset[1]], ...) byte gather makes slow code, trying to zero-extend all the uint16_t offsets first and spilling them pinskia at gcc dot gnu.org
2024-04-15  7:04 ` pinskia at gcc dot gnu.org
2024-04-17  8:50 ` liuhongt at gcc dot gnu.org
2024-04-17  9:30 ` liuhongt at gcc dot gnu.org
2024-04-17  9:45 ` rguenth at gcc dot gnu.org
2024-04-17 10:36 ` rguenth at gcc dot gnu.org
2024-04-17 10:59 ` liuhongt at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).