public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug middle-end/92080] Missed CSE of _mm512_set1_epi8(c) with _mm256_set1_epi8(c)
[not found] <bug-92080-4@http.gcc.gnu.org/bugzilla/>
@ 2021-09-04 22:17 ` pinskia at gcc dot gnu.org
2023-06-13 7:43 ` [Bug rtl-optimization/92080] " rguenth at gcc dot gnu.org
` (5 subsequent siblings)
6 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-09-04 22:17 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92080
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Last reconfirmed|2019-10-14 00:00:00 |2021-9-4
Severity|normal |enhancement
--- Comment #5 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
This gives good code:
#include <immintrin.h>
__m512i sinkz;
__m256i sinky;
void foo(char c) {
__m512i a = _mm512_set1_epi8(c);
sinkz = a;
sinky = *((__m256i*)&a);
}
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug rtl-optimization/92080] Missed CSE of _mm512_set1_epi8(c) with _mm256_set1_epi8(c)
[not found] <bug-92080-4@http.gcc.gnu.org/bugzilla/>
2021-09-04 22:17 ` [Bug middle-end/92080] Missed CSE of _mm512_set1_epi8(c) with _mm256_set1_epi8(c) pinskia at gcc dot gnu.org
@ 2023-06-13 7:43 ` rguenth at gcc dot gnu.org
2024-03-21 7:13 ` liuhongt at gcc dot gnu.org
` (4 subsequent siblings)
6 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-06-13 7:43 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92080
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |crazylht at gmail dot com
Blocks| |53947
Component|middle-end |rtl-optimization
--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
Similar when vectorizing
int a[4096];
void foo ()
{
for (int i = 1; i < 4095; ++i)
a[i] = 42;
}
the combination of peeling for alignment and the epilog yields on GIMPLE:
<bb 2> [local count: 10737416]:
MEM <vector(8) int> [(int *)&a + 4B] = { 42, 42, 42, 42, 42, 42, 42, 42 };
MEM <vector(4) int> [(int *)&a + 36B] = { 42, 42, 42, 42 };
MEM <vector(2) int> [(int *)&a + 52B] = { 42, 42 };
a[15] = 42;
ivtmp.28_59 = (unsigned long) &MEM <int[4096]> [(void *)&a + 64B];
_1 = (unsigned long) &a;
_182 = _1 + 16320;
<bb 3> [local count: 75161909]:
# ivtmp.28_71 = PHI <ivtmp.28_65(3), ivtmp.28_59(2)>
_21 = (void *) ivtmp.28_71;
MEM <vector(16) int> [(int *)_21] = { 42, 42, 42, 42, 42, 42, 42, 42, 42, 42,
42, 42, 42, 42, 42, 42 };
ivtmp.28_65 = ivtmp.28_71 + 64;
if (ivtmp.28_65 != _182)
goto <bb 3>; [85.71%]
else
goto <bb 4>; [14.29%]
<bb 4> [local count: 21474835]:
MEM <vector(8) int> [(int *)&a + 16320B] = { 42, 42, 42, 42, 42, 42, 42, 42
};
MEM <vector(4) int> [(int *)&a + 16352B] = { 42, 42, 42, 42 };
MEM <vector(2) int> [(int *)&a + 16368B] = { 42, 42 };
a[4094] = 42;
return;
and that in turn causes a lot of redundant broadcasts from constants (via
GPRs):
foo:
.LFB0:
.cfi_startproc
movl $42, %eax
movq .LC2(%rip), %rcx
movl $42, %edx
movl $42, a+60(%rip)
vpbroadcastd %eax, %ymm0
vmovdqu %ymm0, a+4(%rip)
vpbroadcastd %eax, %xmm0
movl $a+64, %eax
vmovdqu %xmm0, a+36(%rip)
vpbroadcastd %edx, %zmm0
movq %rcx, a+52(%rip)
.L2:
vmovdqa32 %zmm0, (%rax)
subq $-128, %rax
vmovdqa32 %zmm0, -64(%rax)
cmpq $a+16320, %rax
jne .L2
vpbroadcastd %edx, %ymm0
movq %rcx, a+16368(%rip)
movl $42, a+16376(%rip)
vmovdqa %ymm0, a+16320(%rip)
vpbroadcastd %edx, %xmm0
vmovdqa %xmm0, a+16352(%rip)
vzeroupper
ret
as they are constant on GIMPLE any "CSE" we'd perform there would be undone
quickly by constant propagation. So it's only on RTL where the actual
broadcast is a non-constant operation that we can and should optimize this
somehow. Some kind of LCM to also handle earlier small but later bigger
broadcasts would be necessary here.
Referenced Bugs:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug rtl-optimization/92080] Missed CSE of _mm512_set1_epi8(c) with _mm256_set1_epi8(c)
[not found] <bug-92080-4@http.gcc.gnu.org/bugzilla/>
2021-09-04 22:17 ` [Bug middle-end/92080] Missed CSE of _mm512_set1_epi8(c) with _mm256_set1_epi8(c) pinskia at gcc dot gnu.org
2023-06-13 7:43 ` [Bug rtl-optimization/92080] " rguenth at gcc dot gnu.org
@ 2024-03-21 7:13 ` liuhongt at gcc dot gnu.org
2024-03-21 7:51 ` rguenther at suse dot de
` (3 subsequent siblings)
6 siblings, 0 replies; 7+ messages in thread
From: liuhongt at gcc dot gnu.org @ 2024-03-21 7:13 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92080
Hongtao Liu <liuhongt at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |liuhongt at gcc dot gnu.org
--- Comment #7 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
Another simple case is
typedef int v4si __attribute__((vector_size(16)));
typedef short v8hi __attribute__((vector_size(16)));
v8hi a;
v4si b;
void
foo ()
{
b = __extension__(v4si){0, 0, 0, 0};
a = __extension__(v8hi){0, 0, 0, 0, 0, 0, 0, 0};
}
GCC generates 2 pxor
foo():
vpxor xmm0, xmm0, xmm0
vmovdqa XMMWORD PTR b[rip], xmm0
vpxor xmm0, xmm0, xmm0
vmovdqa XMMWORD PTR a[rip], xmm0
ret
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug rtl-optimization/92080] Missed CSE of _mm512_set1_epi8(c) with _mm256_set1_epi8(c)
[not found] <bug-92080-4@http.gcc.gnu.org/bugzilla/>
` (2 preceding siblings ...)
2024-03-21 7:13 ` liuhongt at gcc dot gnu.org
@ 2024-03-21 7:51 ` rguenther at suse dot de
2024-03-21 8:03 ` liuhongt at gcc dot gnu.org
` (2 subsequent siblings)
6 siblings, 0 replies; 7+ messages in thread
From: rguenther at suse dot de @ 2024-03-21 7:51 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92080
--- Comment #8 from rguenther at suse dot de <rguenther at suse dot de> ---
On Thu, 21 Mar 2024, liuhongt at gcc dot gnu.org wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92080
>
> Hongtao Liu <liuhongt at gcc dot gnu.org> changed:
>
> What |Removed |Added
> ----------------------------------------------------------------------------
> CC| |liuhongt at gcc dot gnu.org
>
> --- Comment #7 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
> Another simple case is
>
> typedef int v4si __attribute__((vector_size(16)));
> typedef short v8hi __attribute__((vector_size(16)));
>
> v8hi a;
> v4si b;
> void
> foo ()
> {
> b = __extension__(v4si){0, 0, 0, 0};
> a = __extension__(v8hi){0, 0, 0, 0, 0, 0, 0, 0};
> }
>
> GCC generates 2 pxor
>
> foo():
> vpxor xmm0, xmm0, xmm0
> vmovdqa XMMWORD PTR b[rip], xmm0
> vpxor xmm0, xmm0, xmm0
> vmovdqa XMMWORD PTR a[rip], xmm0
> ret
If we were to expose that vpxor before postreload we'd likely CSE but
we have
5: xmm0:V4SI=const_vector
REG_EQUIV const_vector
6: [`b']=xmm0:V4SI
7: xmm0:V8HI=const_vector
REG_EQUIV const_vector
8: [`a']=xmm0:V8HI
until the very end. But since we have the same mode size on the xmm0
sets CSE could easily handle (integral) constants by hashing/comparing
on their byte representation rather than by using the RTX structure.
OTOH as we mostly have special constants allowed in the IL like this
treating all-zeros and all-ones specially might be good enough ...
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug rtl-optimization/92080] Missed CSE of _mm512_set1_epi8(c) with _mm256_set1_epi8(c)
[not found] <bug-92080-4@http.gcc.gnu.org/bugzilla/>
` (3 preceding siblings ...)
2024-03-21 7:51 ` rguenther at suse dot de
@ 2024-03-21 8:03 ` liuhongt at gcc dot gnu.org
2024-03-21 8:31 ` rguenth at gcc dot gnu.org
2024-03-21 8:43 ` pinskia at gcc dot gnu.org
6 siblings, 0 replies; 7+ messages in thread
From: liuhongt at gcc dot gnu.org @ 2024-03-21 8:03 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92080
--- Comment #9 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
> If we were to expose that vpxor before postreload we'd likely CSE but
> we have
>
> 5: xmm0:V4SI=const_vector
> REG_EQUIV const_vector
> 6: [`b']=xmm0:V4SI
> 7: xmm0:V8HI=const_vector
> REG_EQUIV const_vector
> 8: [`a']=xmm0:V8HI
>
> until the very end. But since we have the same mode size on the xmm0
> sets CSE could easily handle (integral) constants by hashing/comparing
> on their byte representation rather than by using the RTX structure.
> OTOH as we mostly have special constants allowed in the IL like this
> treating all-zeros and all-ones specially might be good enough ...
We only handle scalar code, guess could do something similar, maybe
1. iteraters over vector modes with same vector length?
2. iteraters over vector modes with same component mode but with bigger vector
length?
But will miss v8hi/v8si pxor, another alternative is canonicalize const_vector
with scalar mode, i.e v4si -> TI, v8si -> OI, v16si -> XI. then we can just
query with TI/OI/XImode?
4873 /* See if we have a CONST_INT that is already in a register in a
4874 wider mode. */
4875
4876 if (src_const && src_related == 0 && CONST_INT_P (src_const)
4877 && is_int_mode (mode, &int_mode)
4878 && GET_MODE_PRECISION (int_mode) < BITS_PER_WORD)
4879 {
4880 opt_scalar_int_mode wider_mode_iter;
4881 FOR_EACH_WIDER_MODE (wider_mode_iter, int_mode)
4882 {
4883 scalar_int_mode wider_mode = wider_mode_iter.require ();
4884 if (GET_MODE_PRECISION (wider_mode) > BITS_PER_WORD)
4885 break;
4886
4887 struct table_elt *const_elt
4888 = lookup (src_const, HASH (src_const, wider_mode),
wider_mode);
4889
4890 if (const_elt == 0)
4891 continue;
4892
4893 for (const_elt = const_elt->first_same_value;
4894 const_elt; const_elt = const_elt->next_same_value)
4895 if (REG_P (const_elt->exp))
4896 {
4897 src_related = gen_lowpart (int_mode, const_elt->exp);
4898 break;
4899 }
4900
4901 if (src_related != 0)
4902 break;
4903 }
4904 }
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug rtl-optimization/92080] Missed CSE of _mm512_set1_epi8(c) with _mm256_set1_epi8(c)
[not found] <bug-92080-4@http.gcc.gnu.org/bugzilla/>
` (4 preceding siblings ...)
2024-03-21 8:03 ` liuhongt at gcc dot gnu.org
@ 2024-03-21 8:31 ` rguenth at gcc dot gnu.org
2024-03-21 8:43 ` pinskia at gcc dot gnu.org
6 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-03-21 8:31 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92080
--- Comment #10 from Richard Biener <rguenth at gcc dot gnu.org> ---
But it's even simpler than the cited case - the mode has the same size (for the
latest testcase, not for the original one, of course).
It's also that after reload a zeroing of V4SImode will also zero ymm but
of course setting V4SImode to all-ones will not set the upper half of
ymm to all-ones but instead "zero-extends".
With CSE it becomes then important what set comes first. If the larger mode
set comes first it's easier. If the smaller mode set comes first you'd
have to change that to a larger one (if the zero-extension is not what you
want).
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug rtl-optimization/92080] Missed CSE of _mm512_set1_epi8(c) with _mm256_set1_epi8(c)
[not found] <bug-92080-4@http.gcc.gnu.org/bugzilla/>
` (5 preceding siblings ...)
2024-03-21 8:31 ` rguenth at gcc dot gnu.org
@ 2024-03-21 8:43 ` pinskia at gcc dot gnu.org
6 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-03-21 8:43 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92080
--- Comment #11 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #6)
> Similar when vectorizing
>
> int a[4096];
>
> void foo ()
> {
> for (int i = 1; i < 4095; ++i)
> a[i] = 42;
> }
This was actually reported by me in PR 99639 but for aarch64.
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2024-03-21 8:43 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <bug-92080-4@http.gcc.gnu.org/bugzilla/>
2021-09-04 22:17 ` [Bug middle-end/92080] Missed CSE of _mm512_set1_epi8(c) with _mm256_set1_epi8(c) pinskia at gcc dot gnu.org
2023-06-13 7:43 ` [Bug rtl-optimization/92080] " rguenth at gcc dot gnu.org
2024-03-21 7:13 ` liuhongt at gcc dot gnu.org
2024-03-21 7:51 ` rguenther at suse dot de
2024-03-21 8:03 ` liuhongt at gcc dot gnu.org
2024-03-21 8:31 ` rguenth at gcc dot gnu.org
2024-03-21 8:43 ` pinskia at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).