public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug c++/107432] New: __builtin_convertvector generates inefficient code
@ 2022-10-27 10:02 g.peterhoff@t-online.de
2022-10-27 15:12 ` [Bug target/107432] " pinskia at gcc dot gnu.org
` (8 more replies)
0 siblings, 9 replies; 10+ messages in thread
From: g.peterhoff@t-online.de @ 2022-10-27 10:02 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107432
Bug ID: 107432
Summary: __builtin_convertvector generates inefficient code
Product: gcc
Version: unknown
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: c++
Assignee: unassigned at gcc dot gnu.org
Reporter: g.peterhoff@t-online.de
Target Milestone: ---
Example: conversion int64_t -> int32_t
avx512f + avx512vl
HW conversions are available.
avx2
There is a correctly working 32-bit-permutation
(_mm256_permutevar8x32_epi32/vpermd) that can be used.
I have not (yet) evaluated whether other conversions (larger int -> smaller
int) are also affected.
PS: On x86 it's already hell to optimize all cases depending on the instruction
set.
PPS: What about -march=znver4 ?
https://godbolt.org/z/3s79bnh7v
thx
Gero
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug target/107432] __builtin_convertvector generates inefficient code
2022-10-27 10:02 [Bug c++/107432] New: __builtin_convertvector generates inefficient code g.peterhoff@t-online.de
@ 2022-10-27 15:12 ` pinskia at gcc dot gnu.org
2022-10-27 16:14 ` g.peterhoff@t-online.de
` (7 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: pinskia at gcc dot gnu.org @ 2022-10-27 15:12 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107432
--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Created attachment 53781
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53781&action=edit
testcase
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug target/107432] __builtin_convertvector generates inefficient code
2022-10-27 10:02 [Bug c++/107432] New: __builtin_convertvector generates inefficient code g.peterhoff@t-online.de
2022-10-27 15:12 ` [Bug target/107432] " pinskia at gcc dot gnu.org
@ 2022-10-27 16:14 ` g.peterhoff@t-online.de
2022-10-28 3:33 ` crazylht at gmail dot com
` (6 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: g.peterhoff@t-online.de @ 2022-10-27 16:14 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107432
--- Comment #2 from g.peterhoff@t-online.de ---
Another example. I want to convert an array<Bool> to array<Float64>.
There are basically 3 options:
- Copy
- Test (b2f64_default)
- optimized version (b2f64_manually)
gcc12.2 + gcctrunc
convertSIZE_copy only generates scalar code (_mm_cvtsi64_sd)
convertSIZE_default always generates conditional jumps
convertSIZE_manually
gcctrunc always generates branch-free scalar code
gcc12.2
convert1024_manually generates vector code, but does not use HW conversion
int8->int64 (_mm(256)_cvtepi8_epi64) and converts int8->int16->int32->int64
manually
convert8_manually generates branch-free scalar code
convert4_manually generates vector code and uses HW conversion int8->int64
NONE of these conversions are transformed/optimized to the extent that always
- all available intrinsics are used
- no "normal" registers are used
- branch-free code is generated
https://godbolt.org/z/f74vK79of
thx
Gero
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug target/107432] __builtin_convertvector generates inefficient code
2022-10-27 10:02 [Bug c++/107432] New: __builtin_convertvector generates inefficient code g.peterhoff@t-online.de
2022-10-27 15:12 ` [Bug target/107432] " pinskia at gcc dot gnu.org
2022-10-27 16:14 ` g.peterhoff@t-online.de
@ 2022-10-28 3:33 ` crazylht at gmail dot com
2022-10-28 3:36 ` crazylht at gmail dot com
` (5 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: crazylht at gmail dot com @ 2022-10-28 3:33 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107432
--- Comment #3 from Hongtao.liu <crazylht at gmail dot com> ---
typedef int v4si __attribute__((vector_size(16)));
typedef long long v4di __attribute__((vector_size(32)));
v4si
foo (v4di a)
{
return __builtin_convertvector (a, v4si);
}
hmm, we actually support truncv4div4si2, but some how gcc failed to generate
.VEC_CONVERT with truncmn2.
hmm, what's optab for convert_optab_handler?
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug target/107432] __builtin_convertvector generates inefficient code
2022-10-27 10:02 [Bug c++/107432] New: __builtin_convertvector generates inefficient code g.peterhoff@t-online.de
` (2 preceding siblings ...)
2022-10-28 3:33 ` crazylht at gmail dot com
@ 2022-10-28 3:36 ` crazylht at gmail dot com
2022-10-28 5:22 ` crazylht at gmail dot com
` (4 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: crazylht at gmail dot com @ 2022-10-28 3:36 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107432
--- Comment #4 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Hongtao.liu from comment #3)
> typedef int v4si __attribute__((vector_size(16)));
> typedef long long v4di __attribute__((vector_size(32)));
>
> v4si
> foo (v4di a)
> {
> return __builtin_convertvector (a, v4si);
> }
>
> hmm, we actually support truncv4div4si2, but some how gcc failed to generate
> .VEC_CONVERT with truncmn2.
>
/* IFN_VEC_CONVERT is supposed to be expanded at pass_lower_vector. So this
dummy function should never be called. */
static void
expand_VEC_CONVERT (internal_fn, gcall *)
{
gcc_unreachable ();
}
It's lowered by pass_lower_vector, ideally, can we use truncmn2 in
expand_VEC_CONVERT if src is bigger integer mode than dest.
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug target/107432] __builtin_convertvector generates inefficient code
2022-10-27 10:02 [Bug c++/107432] New: __builtin_convertvector generates inefficient code g.peterhoff@t-online.de
` (3 preceding siblings ...)
2022-10-28 3:36 ` crazylht at gmail dot com
@ 2022-10-28 5:22 ` crazylht at gmail dot com
2022-10-28 5:33 ` crazylht at gmail dot com
` (3 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: crazylht at gmail dot com @ 2022-10-28 5:22 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107432
--- Comment #5 from Hongtao.liu <crazylht at gmail dot com> ---
> It's lowered by pass_lower_vector, ideally, can we use truncmn2 in
> expand_VEC_CONVERT if src is bigger integer mode than dest.
Currently, expand_vector_conversion uses VEC_PACK_TRUNC_EXPR
---------------cut begins------------------------
else if (modifier == NARROW)
{
switch (code)
{
CASE_CONVERT:
code1 = VEC_PACK_TRUNC_EXPR;
optab1 = optab_for_tree_code (code1, arg_type, optab_default);
break;
---------------Cut ends------------------------
But BB vectorizer can do the right thing for
void
foo (long long* a, int* b)
{
b[0] = a[0];
b[1] = a[1];
b[2] = a[2];
b[3] = a[3];
}
vmovdqu ymm0, YMMWORD PTR [rdi]
vpmovqd XMMWORD PTR [rsi], ymm0
vzeroupper
ret
vect__1.5_16 = MEM <vector(4) long long int> [(long long int *)a_10(D)];
vect__2.6_18 = (vector(4) int) vect__1.5_16;
# DEBUG BEGIN_STMT
# DEBUG BEGIN_STMT
# DEBUG BEGIN_STMT
MEM <vector(4) int> [(int *)b_11(D)] = vect__2.6_18;
return;
Guess expand_vector_conversion can be optimized.
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug target/107432] __builtin_convertvector generates inefficient code
2022-10-27 10:02 [Bug c++/107432] New: __builtin_convertvector generates inefficient code g.peterhoff@t-online.de
` (4 preceding siblings ...)
2022-10-28 5:22 ` crazylht at gmail dot com
@ 2022-10-28 5:33 ` crazylht at gmail dot com
2022-10-28 6:55 ` crazylht at gmail dot com
` (2 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: crazylht at gmail dot com @ 2022-10-28 5:33 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107432
--- Comment #6 from Hongtao.liu <crazylht at gmail dot com> ---
> Guess expand_vector_conversion can be optimized.
if (INTEGRAL_TYPE_P (TREE_TYPE (ret_type))
&& SCALAR_FLOAT_TYPE_P (TREE_TYPE (arg_type)))
code = FIX_TRUNC_EXPR;
else if (INTEGRAL_TYPE_P (TREE_TYPE (arg_type))
&& SCALAR_FLOAT_TYPE_P (TREE_TYPE (ret_type)))
code = FLOAT_EXPR;
It only supports floatmn2/fix_truncmn2 for float <-> integer.
But we can also supports extendmn2/zero_extendmn2/truncmn2 for float <-> float,
integer <-> integer.
Or are there any concerns and VEC_PACK_TRUNC_EXPR,
VEC_PACK_FIX_TRUNC_EXPR,VEC_PACK_FLOAT_EXPR are used on purpose?
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug target/107432] __builtin_convertvector generates inefficient code
2022-10-27 10:02 [Bug c++/107432] New: __builtin_convertvector generates inefficient code g.peterhoff@t-online.de
` (5 preceding siblings ...)
2022-10-28 5:33 ` crazylht at gmail dot com
@ 2022-10-28 6:55 ` crazylht at gmail dot com
2022-10-28 11:41 ` rguenth at gcc dot gnu.org
2022-10-31 13:02 ` rsandifo at gcc dot gnu.org
8 siblings, 0 replies; 10+ messages in thread
From: crazylht at gmail dot com @ 2022-10-28 6:55 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107432
--- Comment #7 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Hongtao.liu from comment #6)
> > Guess expand_vector_conversion can be optimized.
>
> if (INTEGRAL_TYPE_P (TREE_TYPE (ret_type))
> && SCALAR_FLOAT_TYPE_P (TREE_TYPE (arg_type)))
> code = FIX_TRUNC_EXPR;
> else if (INTEGRAL_TYPE_P (TREE_TYPE (arg_type))
> && SCALAR_FLOAT_TYPE_P (TREE_TYPE (ret_type)))
> code = FLOAT_EXPR;
>
> It only supports floatmn2/fix_truncmn2 for float <-> integer.
>
> But we can also supports extendmn2/zero_extendmn2/truncmn2 for float <->
> float, integer <-> integer.
>
> Or are there any concerns and VEC_PACK_TRUNC_EXPR,
> VEC_PACK_FIX_TRUNC_EXPR,VEC_PACK_FLOAT_EXPR are used on purpose?
May be we can add some gimple simplication in match.pd to hanlde
_4 = VEC_PACK_TRUNC_EXPR <a_1(D), { 0, 0, 0, 0 }>;
_5 = BIT_FIELD_REF <_4, 128, 0>;
and
_4 = [vec_unpack_lo_expr] a_1(D);
_5 = [vec_unpack_hi_expr] a_1(D);
_2 = {_4, _5};
Since loop vectorizer may also create vec_unpack_lo_expr/vec_unpack_hi_expr.
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug target/107432] __builtin_convertvector generates inefficient code
2022-10-27 10:02 [Bug c++/107432] New: __builtin_convertvector generates inefficient code g.peterhoff@t-online.de
` (6 preceding siblings ...)
2022-10-28 6:55 ` crazylht at gmail dot com
@ 2022-10-28 11:41 ` rguenth at gcc dot gnu.org
2022-10-31 13:02 ` rsandifo at gcc dot gnu.org
8 siblings, 0 replies; 10+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-10-28 11:41 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107432
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |rguenth at gcc dot gnu.org,
| |rsandifo at gcc dot gnu.org
Target|X86_64 |x86_64-*-*
Last reconfirmed| |2022-10-28
Status|UNCONFIRMED |NEW
Ever confirmed|0 |1
Version|unknown |13.0
--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Hongtao.liu from comment #6)
> > Guess expand_vector_conversion can be optimized.
>
> if (INTEGRAL_TYPE_P (TREE_TYPE (ret_type))
> && SCALAR_FLOAT_TYPE_P (TREE_TYPE (arg_type)))
> code = FIX_TRUNC_EXPR;
> else if (INTEGRAL_TYPE_P (TREE_TYPE (arg_type))
> && SCALAR_FLOAT_TYPE_P (TREE_TYPE (ret_type)))
> code = FLOAT_EXPR;
>
> It only supports floatmn2/fix_truncmn2 for float <-> integer.
>
> But we can also supports extendmn2/zero_extendmn2/truncmn2 for float <->
> float, integer <-> integer.
>
> Or are there any concerns and VEC_PACK_TRUNC_EXPR,
> VEC_PACK_FIX_TRUNC_EXPR,VEC_PACK_FLOAT_EXPR are used on purpose?
I think we do support FIX_TRUNC_EXPR or FLOAT_EXPR for float <-> int
conversion of vectors like we now support {CONVERT,NOP}_EXPR for
just widening/shortening. At least the GIMPLE verifier allows that.
The obtabs would be [us]fix and [us]float, not sure if aarch64 makes use
of those for vector modes or if Richard extended the vectorizer to
consider those (I only remember int <-> int conversions).
So I think if x86_64 can do float <-> int for vectors implementing
[us]fix/[us]float would be the way to go (and of course then make use
of those in lowering/vectorization).
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug target/107432] __builtin_convertvector generates inefficient code
2022-10-27 10:02 [Bug c++/107432] New: __builtin_convertvector generates inefficient code g.peterhoff@t-online.de
` (7 preceding siblings ...)
2022-10-28 11:41 ` rguenth at gcc dot gnu.org
@ 2022-10-31 13:02 ` rsandifo at gcc dot gnu.org
8 siblings, 0 replies; 10+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2022-10-31 13:02 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107432
--- Comment #9 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #8)
> I think we do support FIX_TRUNC_EXPR or FLOAT_EXPR for float <-> int
> conversion of vectors like we now support {CONVERT,NOP}_EXPR for
> just widening/shortening. At least the GIMPLE verifier allows that.
>
> The obtabs would be [us]fix and [us]float, not sure if aarch64 makes use
> of those for vector modes or if Richard extended the vectorizer to
> consider those (I only remember int <-> int conversions).
AArch64 doesn't use mixed-size vector fix and float yet, but the hope
is that would in future. For SVE, the main difficulty is that FP
conversions could raise exceptions, so only the conditional forms
would be interesting for normal predicated loops under default flags.
The unpredicated optabs would require -ffast-math-like flags.
This is probably lower hanging fruit for Advanced SIMD though.
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2022-10-31 13:03 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-27 10:02 [Bug c++/107432] New: __builtin_convertvector generates inefficient code g.peterhoff@t-online.de
2022-10-27 15:12 ` [Bug target/107432] " pinskia at gcc dot gnu.org
2022-10-27 16:14 ` g.peterhoff@t-online.de
2022-10-28 3:33 ` crazylht at gmail dot com
2022-10-28 3:36 ` crazylht at gmail dot com
2022-10-28 5:22 ` crazylht at gmail dot com
2022-10-28 5:33 ` crazylht at gmail dot com
2022-10-28 6:55 ` crazylht at gmail dot com
2022-10-28 11:41 ` rguenth at gcc dot gnu.org
2022-10-31 13:02 ` rsandifo at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).