* [Bug target/110202] _mm512_ternarylogic_epi64 generates unnecessary operations
2023-06-10 10:37 [Bug c++/110202] New: _mm512_ternarylogic_epi64 generates unnecessary operations fabio at cannizzo dot net
@ 2023-06-10 16:07 ` pinskia at gcc dot gnu.org
2023-06-10 16:14 ` [Bug rtl-optimization/110202] " pinskia at gcc dot gnu.org
` (10 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-06-10 16:07 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110202
--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Created attachment 55299
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55299&action=edit
Corrected testcase
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug rtl-optimization/110202] _mm512_ternarylogic_epi64 generates unnecessary operations
2023-06-10 10:37 [Bug c++/110202] New: _mm512_ternarylogic_epi64 generates unnecessary operations fabio at cannizzo dot net
2023-06-10 16:07 ` [Bug target/110202] " pinskia at gcc dot gnu.org
@ 2023-06-10 16:14 ` pinskia at gcc dot gnu.org
2023-06-10 18:36 ` jakub at gcc dot gnu.org
` (9 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-06-10 16:14 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110202
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Component|target |rtl-optimization
Severity|normal |enhancement
--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Note you get a warning in your negate1 case
<source>: In function '__m512i negate1(const __m512i*)':
<source>:7:36: warning: 'res' is used uninitialized [-Wuninitialized]
7 | res = _mm512_ternarylogic_epi64(res, res, *a, 0x55);
| ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~
<source>:6:13: note: 'res' was declared here
6 | __m512i res;
| ^~~
But even doing this:
__m512i negate1(const __m512i *a)
{
__m512i res = _mm512_undefined_si512 ();
res = _mm512_ternarylogic_epi64(res, res, *a, 0x55);
return res;
}
Will cause an extra zeroing.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug rtl-optimization/110202] _mm512_ternarylogic_epi64 generates unnecessary operations
2023-06-10 10:37 [Bug c++/110202] New: _mm512_ternarylogic_epi64 generates unnecessary operations fabio at cannizzo dot net
2023-06-10 16:07 ` [Bug target/110202] " pinskia at gcc dot gnu.org
2023-06-10 16:14 ` [Bug rtl-optimization/110202] " pinskia at gcc dot gnu.org
@ 2023-06-10 18:36 ` jakub at gcc dot gnu.org
2023-06-10 21:10 ` pinskia at gcc dot gnu.org
` (8 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: jakub at gcc dot gnu.org @ 2023-06-10 18:36 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110202
Jakub Jelinek <jakub at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |hjl.tools at gmail dot com,
| |jakub at gcc dot gnu.org
--- Comment #3 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Well, there is nothing magic on exactly 0x55 immediate, there are 256 possible
immediates, most of them use all of A, B, C, some of them use just A, B, others
just B, C, others just A, C, others just A, others just B, others just C,
others none of them.
And I must say I don't immediately see easy rules how to find out from the
immediate value which set is which, so unless we find some easy rule for that,
we'd need to hardcode the mapping between the 256 values to a bitmask which
inputs are actually used.
And then the question is how to represent that in RTL to make it clear that
some operands are mentioned but their value isn't really used.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug rtl-optimization/110202] _mm512_ternarylogic_epi64 generates unnecessary operations
2023-06-10 10:37 [Bug c++/110202] New: _mm512_ternarylogic_epi64 generates unnecessary operations fabio at cannizzo dot net
` (2 preceding siblings ...)
2023-06-10 18:36 ` jakub at gcc dot gnu.org
@ 2023-06-10 21:10 ` pinskia at gcc dot gnu.org
2023-06-12 17:14 ` fabio at cannizzo dot net
` (7 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-06-10 21:10 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110202
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Last reconfirmed| |2023-06-10
Ever confirmed|0 |1
Status|UNCONFIRMED |NEW
--- Comment #4 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Jakub Jelinek from comment #3)
> Well, there is nothing magic on exactly 0x55 immediate, there are 256
> possible immediates, most of them use all of A, B, C, some of them use just
> A, B, others just B, C, others just A, C, others just A, others just B,
> others just C, others none of them.
> And I must say I don't immediately see easy rules how to find out from the
> immediate value which set is which, so unless we find some easy rule for
> that, we'd need to hardcode the mapping between the 256 values to a bitmask
> which inputs are actually used.
> And then the question is how to represent that in RTL to make it clear that
> some operands are mentioned but their value isn't really used.
In the case of 0x55, an idea might be to split (or expand) it into how ~ is
represented.
That is:
(insn:TI 6 3 12 2 (set (reg:V8DI 20 xmm0 [85])
(xor:V8DI (mem:V8DI (reg/v/f:DI 5 di [orig:84 a ] [84]) [0 *a_3(D)+0
S64 A512])
(const_vector:V8DI [
(const_int -1 [0xffffffffffffffff]) repeated x8
]))) "/app/example.cpp":21:14 6764 {*one_cmplv8di2}
(expr_list:REG_DEAD (reg/v/f:DI 5 di [orig:84 a ] [84])
(nil)))
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug rtl-optimization/110202] _mm512_ternarylogic_epi64 generates unnecessary operations
2023-06-10 10:37 [Bug c++/110202] New: _mm512_ternarylogic_epi64 generates unnecessary operations fabio at cannizzo dot net
` (3 preceding siblings ...)
2023-06-10 21:10 ` pinskia at gcc dot gnu.org
@ 2023-06-12 17:14 ` fabio at cannizzo dot net
2023-06-12 19:04 ` amonakov at gcc dot gnu.org
` (6 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: fabio at cannizzo dot net @ 2023-06-12 17:14 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110202
--- Comment #5 from Fabio Cannizzo <fabio at cannizzo dot net> ---
> Well, there is nothing magic on exactly 0x55 immediate, there are 256
> possible immediates, most of them use all of A, B, C, some of them use just
> A, B, others just B, C, others just A, C, others just A, others just B,
> others just C, others none of them.
Indeed I meant 0x55 just as an example.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug rtl-optimization/110202] _mm512_ternarylogic_epi64 generates unnecessary operations
2023-06-10 10:37 [Bug c++/110202] New: _mm512_ternarylogic_epi64 generates unnecessary operations fabio at cannizzo dot net
` (4 preceding siblings ...)
2023-06-12 17:14 ` fabio at cannizzo dot net
@ 2023-06-12 19:04 ` amonakov at gcc dot gnu.org
2023-06-27 17:59 ` amonakov at gcc dot gnu.org
` (5 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: amonakov at gcc dot gnu.org @ 2023-06-12 19:04 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110202
Alexander Monakov <amonakov at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |amonakov at gcc dot gnu.org
--- Comment #6 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
(In reply to Jakub Jelinek from comment #3)
> And I must say I don't immediately see easy rules how to find out from the
> immediate value which set is which, so unless we find some easy rule for
> that, we'd need to hardcode the mapping between the 256 values to a bitmask
> which inputs are actually used.
Well, that's really easy. The immediate is just a eight-entry look-up table
from any possible input bit triple to the output bit. The leftmost operand
corresponds to the most significant bit in the triple, so to check if the
operation vpternlog(A, B, C, I) is invariant w.r.t A you check if nibbles of I
are equal. Here we have 0x55, equal nibbles, and the operation is invariant
w.r.t A.
Similarly, to check if it's invariant w.r.t B we check if two-bit groups in I
come in pairs, or in code: (I & 0x33) == ((I >> 2) & 0x33). For 0x55 both sides
evaluate to 0x11, so again, invariant w.r.t B.
Finally, checking invariantness w.r.t C is (I & 0x55) == ((I >> 1) & 0x55).
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug rtl-optimization/110202] _mm512_ternarylogic_epi64 generates unnecessary operations
2023-06-10 10:37 [Bug c++/110202] New: _mm512_ternarylogic_epi64 generates unnecessary operations fabio at cannizzo dot net
` (5 preceding siblings ...)
2023-06-12 19:04 ` amonakov at gcc dot gnu.org
@ 2023-06-27 17:59 ` amonakov at gcc dot gnu.org
2023-06-28 0:47 ` crazylht at gmail dot com
` (4 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: amonakov at gcc dot gnu.org @ 2023-06-27 17:59 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110202
--- Comment #7 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
Note that vpxor serves as a dependency-breaking instruction (see PR 110438). So
in negate1 we do the right thing for the wrong reasons, and in negate2 we can
cause a substantial stall if the previous computation of xmm0 has a non-trivial
dependency chain.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug rtl-optimization/110202] _mm512_ternarylogic_epi64 generates unnecessary operations
2023-06-10 10:37 [Bug c++/110202] New: _mm512_ternarylogic_epi64 generates unnecessary operations fabio at cannizzo dot net
` (6 preceding siblings ...)
2023-06-27 17:59 ` amonakov at gcc dot gnu.org
@ 2023-06-28 0:47 ` crazylht at gmail dot com
2023-06-28 5:07 ` amonakov at gcc dot gnu.org
` (3 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: crazylht at gmail dot com @ 2023-06-28 0:47 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110202
--- Comment #8 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Alexander Monakov from comment #7)
> Note that vpxor serves as a dependency-breaking instruction (see PR 110438).
> So in negate1 we do the right thing for the wrong reasons, and in negate2 we
> can cause a substantial stall if the previous computation of xmm0 has a
> non-trivial dependency chain.
For this one, we can load *a into %zmm0 to avoid false_dependence.
vmovdqau ZMMWORD PTR [rdi], zmm0
vpternlogq zmm0, zmm0, zmm0, 85
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug rtl-optimization/110202] _mm512_ternarylogic_epi64 generates unnecessary operations
2023-06-10 10:37 [Bug c++/110202] New: _mm512_ternarylogic_epi64 generates unnecessary operations fabio at cannizzo dot net
` (7 preceding siblings ...)
2023-06-28 0:47 ` crazylht at gmail dot com
@ 2023-06-28 5:07 ` amonakov at gcc dot gnu.org
2023-07-12 7:51 ` cvs-commit at gcc dot gnu.org
` (2 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: amonakov at gcc dot gnu.org @ 2023-06-28 5:07 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110202
--- Comment #9 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
(In reply to Hongtao.liu from comment #8)
>
> For this one, we can load *a into %zmm0 to avoid false_dependence.
>
> vmovdqau ZMMWORD PTR [rdi], zmm0
> vpternlogq zmm0, zmm0, zmm0, 85
Yes, since ternlog with memory operand needs two fused-domain uops on Intel
CPUs, breaking out the load would be more efficient for both negate1 and
negate2.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug rtl-optimization/110202] _mm512_ternarylogic_epi64 generates unnecessary operations
2023-06-10 10:37 [Bug c++/110202] New: _mm512_ternarylogic_epi64 generates unnecessary operations fabio at cannizzo dot net
` (8 preceding siblings ...)
2023-06-28 5:07 ` amonakov at gcc dot gnu.org
@ 2023-07-12 7:51 ` cvs-commit at gcc dot gnu.org
2023-08-04 16:44 ` [Bug target/110202] " cvs-commit at gcc dot gnu.org
2023-08-05 15:32 ` amonakov at gcc dot gnu.org
11 siblings, 0 replies; 13+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2023-07-12 7:51 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110202
--- Comment #10 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:
https://gcc.gnu.org/g:13c556d6ae84be3ee2bc245a56eafa58221de86a
commit r14-2447-g13c556d6ae84be3ee2bc245a56eafa58221de86a
Author: liuhongt <hongtao.liu@intel.com>
Date: Thu Jun 29 14:25:28 2023 +0800
Break false dependence for vpternlog by inserting vpxor or setting
constraint of input operand to '0'
False dependency happens when destination is only updated by
pternlog. There is no false dependency when destination is also used
in source. So either a pxor should be inserted, or input operand
should be set with constraint '0'.
gcc/ChangeLog:
PR target/110438
PR target/110202
* config/i386/predicates.md
(int_float_vector_all_ones_operand): New predicate.
* config/i386/sse.md (*vmov<mode>_constm1_pternlog_false_dep): New
define_insn.
(*<avx512>_cvtmask2<ssemodesuffix><mode>_pternlog_false_dep):
Ditto.
(*<avx512>_cvtmask2<ssemodesuffix><mode>_pternlog_false_dep):
Ditto.
(*<avx512>_cvtmask2<ssemodesuffix><mode>): Adjust to
define_insn_and_split to avoid false dependence.
(*<avx512>_cvtmask2<ssemodesuffix><mode>): Ditto.
(<mask_codefor>one_cmpl<mode>2<mask_name>): Adjust constraint
of operands 1 to '0' to avoid false dependence.
(*andnot<mode>3): Ditto.
(iornot<mode>3): Ditto.
(*<nlogic><mode>3): Ditto.
gcc/testsuite/ChangeLog:
* gcc.target/i386/pr110438.c: New test.
* gcc.target/i386/pr100711-6.c: Adjust testcase.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug target/110202] _mm512_ternarylogic_epi64 generates unnecessary operations
2023-06-10 10:37 [Bug c++/110202] New: _mm512_ternarylogic_epi64 generates unnecessary operations fabio at cannizzo dot net
` (9 preceding siblings ...)
2023-07-12 7:51 ` cvs-commit at gcc dot gnu.org
@ 2023-08-04 16:44 ` cvs-commit at gcc dot gnu.org
2023-08-05 15:32 ` amonakov at gcc dot gnu.org
11 siblings, 0 replies; 13+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2023-08-04 16:44 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110202
--- Comment #11 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Alexander Monakov <amonakov@gcc.gnu.org>:
https://gcc.gnu.org/g:567d06bb357a39ece865cef67ada44124f227e45
commit r14-2999-g567d06bb357a39ece865cef67ada44124f227e45
Author: Yan Simonaytes <simonaytes.yan@ispras.ru>
Date: Tue Jul 25 20:43:19 2023 +0300
i386: eliminate redundant operands of VPTERNLOG
As mentioned in PR 110202, GCC may be presented with input where control
word of the VPTERNLOG intrinsic implies that some of its operands do not
affect the result. In that case, we can eliminate redundant operands
of the instruction by substituting any other operand in their place.
This removes false dependencies.
For instance, instead of (252 = 0xfc = _MM_TERNLOG_A | _MM_TERNLOG_B)
vpternlogq $252, %zmm2, %zmm1, %zmm0
emit
vpternlogq $252, %zmm0, %zmm1, %zmm0
When VPTERNLOG is invariant w.r.t first and second operands, and the
third operand is memory, load memory into the output operand first, i.e.
instead of (85 = 0x55 = ~_MM_TERNLOG_C)
vpternlogq $85, (%rdi), %zmm1, %zmm0
emit
vmovdqa64 (%rdi), %zmm0
vpternlogq $85, %zmm0, %zmm0, %zmm0
gcc/ChangeLog:
PR target/110202
* config/i386/i386-protos.h
(vpternlog_redundant_operand_mask): Declare.
(substitute_vpternlog_operands): Declare.
* config/i386/i386.cc
(vpternlog_redundant_operand_mask): New helper.
(substitute_vpternlog_operands): New function. Use them...
* config/i386/sse.md: ... here in new VPTERNLOG define_splits.
gcc/testsuite/ChangeLog:
PR target/110202
* gcc.target/i386/invariant-ternlog-1.c: New test.
* gcc.target/i386/invariant-ternlog-2.c: New test.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug target/110202] _mm512_ternarylogic_epi64 generates unnecessary operations
2023-06-10 10:37 [Bug c++/110202] New: _mm512_ternarylogic_epi64 generates unnecessary operations fabio at cannizzo dot net
` (10 preceding siblings ...)
2023-08-04 16:44 ` [Bug target/110202] " cvs-commit at gcc dot gnu.org
@ 2023-08-05 15:32 ` amonakov at gcc dot gnu.org
11 siblings, 0 replies; 13+ messages in thread
From: amonakov at gcc dot gnu.org @ 2023-08-05 15:32 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110202
Alexander Monakov <amonakov at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution|--- |FIXED
--- Comment #12 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
We now generate
negate1:
vmovdqa64 zmm0, ZMMWORD PTR [rdi]
vpternlogq zmm0, zmm0, zmm0, 85
ret
negate2:
vmovdqa32 zmm0, ZMMWORD PTR [rdi]
vpternlogd zmm0, zmm0, zmm0, 0x55
ret
Fixed for gcc-14.
^ permalink raw reply [flat|nested] 13+ messages in thread