[Bug c++/107432] New: __builtin_convertvector generates inefficient code

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug c++/107432] New: __builtin_convertvector generates inefficient code
@ 2022-10-27 10:02 g.peterhoff@t-online.de
  2022-10-27 15:12 ` [Bug target/107432] " pinskia at gcc dot gnu.org
                   ` (8 more replies)
  0 siblings, 9 replies; 10+ messages in thread
From: g.peterhoff@t-online.de @ 2022-10-27 10:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107432

            Bug ID: 107432
           Summary: __builtin_convertvector generates inefficient code
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: g.peterhoff@t-online.de
  Target Milestone: ---

Example: conversion int64_t -> int32_t

avx512f + avx512vl
HW conversions are available.

avx2
There is a correctly working 32-bit-permutation
(_mm256_permutevar8x32_epi32/vpermd) that can be used.

I have not (yet) evaluated whether other conversions (larger int -> smaller
int) are also affected.
PS: On x86 it's already hell to optimize all cases depending on the instruction
set.
PPS: What about -march=znver4 ?

https://godbolt.org/z/3s79bnh7v

thx
Gero

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug target/107432] __builtin_convertvector generates inefficient code
  2022-10-27 10:02 [Bug c++/107432] New: __builtin_convertvector generates inefficient code g.peterhoff@t-online.de
@ 2022-10-27 15:12 ` pinskia at gcc dot gnu.org
  2022-10-27 16:14 ` g.peterhoff@t-online.de
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: pinskia at gcc dot gnu.org @ 2022-10-27 15:12 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107432

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Created attachment 53781
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53781&action=edit
testcase

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug target/107432] __builtin_convertvector generates inefficient code
  2022-10-27 10:02 [Bug c++/107432] New: __builtin_convertvector generates inefficient code g.peterhoff@t-online.de
  2022-10-27 15:12 ` [Bug target/107432] " pinskia at gcc dot gnu.org
@ 2022-10-27 16:14 ` g.peterhoff@t-online.de
  2022-10-28  3:33 ` crazylht at gmail dot com
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: g.peterhoff@t-online.de @ 2022-10-27 16:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107432

--- Comment #2 from g.peterhoff@t-online.de ---
Another example. I want to convert an array<Bool> to array<Float64>.
There are basically 3 options:
- Copy
- Test (b2f64_default)
- optimized version (b2f64_manually)

gcc12.2 + gcctrunc
convertSIZE_copy only generates scalar code (_mm_cvtsi64_sd)
convertSIZE_default always generates conditional jumps

convertSIZE_manually
gcctrunc always generates branch-free scalar code
gcc12.2
convert1024_manually generates vector code, but does not use HW conversion
int8->int64 (_mm(256)_cvtepi8_epi64) and converts int8->int16->int32->int64
manually
convert8_manually generates branch-free scalar code
convert4_manually generates vector code and uses HW conversion int8->int64


NONE of these conversions are transformed/optimized to the extent that always
- all available intrinsics are used
- no "normal" registers are used
- branch-free code is generated

https://godbolt.org/z/f74vK79of

thx
Gero

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug target/107432] __builtin_convertvector generates inefficient code
  2022-10-27 10:02 [Bug c++/107432] New: __builtin_convertvector generates inefficient code g.peterhoff@t-online.de
  2022-10-27 15:12 ` [Bug target/107432] " pinskia at gcc dot gnu.org
  2022-10-27 16:14 ` g.peterhoff@t-online.de
@ 2022-10-28  3:33 ` crazylht at gmail dot com
  2022-10-28  3:36 ` crazylht at gmail dot com
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: crazylht at gmail dot com @ 2022-10-28  3:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107432

--- Comment #3 from Hongtao.liu <crazylht at gmail dot com> ---
typedef int v4si __attribute__((vector_size(16)));
typedef long long v4di __attribute__((vector_size(32)));

v4si
foo (v4di a)
{
    return __builtin_convertvector (a, v4si);
}

hmm, we actually support truncv4div4si2, but some how gcc failed to generate
.VEC_CONVERT with truncmn2.

hmm, what's optab for convert_optab_handler?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug target/107432] __builtin_convertvector generates inefficient code
  2022-10-27 10:02 [Bug c++/107432] New: __builtin_convertvector generates inefficient code g.peterhoff@t-online.de
                   ` (2 preceding siblings ...)
  2022-10-28  3:33 ` crazylht at gmail dot com
@ 2022-10-28  3:36 ` crazylht at gmail dot com
  2022-10-28  5:22 ` crazylht at gmail dot com
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: crazylht at gmail dot com @ 2022-10-28  3:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107432

--- Comment #4 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Hongtao.liu from comment #3)
> typedef int v4si __attribute__((vector_size(16)));
> typedef long long v4di __attribute__((vector_size(32)));
> 
> v4si
> foo (v4di a)
> {
>     return __builtin_convertvector (a, v4si);
> }
> 
> hmm, we actually support truncv4div4si2, but some how gcc failed to generate
> .VEC_CONVERT with truncmn2.
> 

/* IFN_VEC_CONVERT is supposed to be expanded at pass_lower_vector.  So this
   dummy function should never be called.  */

static void
expand_VEC_CONVERT (internal_fn, gcall *)
{
  gcc_unreachable ();
}

It's lowered by pass_lower_vector, ideally, can we use truncmn2 in
expand_VEC_CONVERT if src is bigger integer mode than dest.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug target/107432] __builtin_convertvector generates inefficient code
  2022-10-27 10:02 [Bug c++/107432] New: __builtin_convertvector generates inefficient code g.peterhoff@t-online.de
                   ` (3 preceding siblings ...)
  2022-10-28  3:36 ` crazylht at gmail dot com
@ 2022-10-28  5:22 ` crazylht at gmail dot com
  2022-10-28  5:33 ` crazylht at gmail dot com
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: crazylht at gmail dot com @ 2022-10-28  5:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107432

--- Comment #5 from Hongtao.liu <crazylht at gmail dot com> ---

> It's lowered by pass_lower_vector, ideally, can we use truncmn2 in
> expand_VEC_CONVERT if src is bigger integer mode than dest.

Currently, expand_vector_conversion uses VEC_PACK_TRUNC_EXPR

---------------cut begins------------------------
  else if (modifier == NARROW)
    {
      switch (code)
        {
        CASE_CONVERT:
          code1 = VEC_PACK_TRUNC_EXPR;
          optab1 = optab_for_tree_code (code1, arg_type, optab_default);
          break;

---------------Cut ends------------------------

But BB vectorizer can do the right thing for 

void
foo (long long* a, int* b)
{
    b[0] = a[0];
    b[1] = a[1];
    b[2] = a[2];
    b[3] = a[3];
}



        vmovdqu ymm0, YMMWORD PTR [rdi]
        vpmovqd XMMWORD PTR [rsi], ymm0
        vzeroupper
        ret


  vect__1.5_16 = MEM <vector(4) long long int> [(long long int *)a_10(D)];
  vect__2.6_18 = (vector(4) int) vect__1.5_16;
  # DEBUG BEGIN_STMT
  # DEBUG BEGIN_STMT
  # DEBUG BEGIN_STMT
  MEM <vector(4) int> [(int *)b_11(D)] = vect__2.6_18;
  return;


Guess expand_vector_conversion can be optimized.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug target/107432] __builtin_convertvector generates inefficient code
  2022-10-27 10:02 [Bug c++/107432] New: __builtin_convertvector generates inefficient code g.peterhoff@t-online.de
                   ` (4 preceding siblings ...)
  2022-10-28  5:22 ` crazylht at gmail dot com
@ 2022-10-28  5:33 ` crazylht at gmail dot com
  2022-10-28  6:55 ` crazylht at gmail dot com
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: crazylht at gmail dot com @ 2022-10-28  5:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107432

--- Comment #6 from Hongtao.liu <crazylht at gmail dot com> ---

> Guess expand_vector_conversion can be optimized.

  if (INTEGRAL_TYPE_P (TREE_TYPE (ret_type))
      && SCALAR_FLOAT_TYPE_P (TREE_TYPE (arg_type)))
    code = FIX_TRUNC_EXPR;
  else if (INTEGRAL_TYPE_P (TREE_TYPE (arg_type))
           && SCALAR_FLOAT_TYPE_P (TREE_TYPE (ret_type)))
    code = FLOAT_EXPR;

It only supports floatmn2/fix_truncmn2 for float <-> integer.

But we can also supports extendmn2/zero_extendmn2/truncmn2 for float <-> float,
integer <-> integer.

Or are there any concerns and VEC_PACK_TRUNC_EXPR,
VEC_PACK_FIX_TRUNC_EXPR,VEC_PACK_FLOAT_EXPR are used on purpose?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug target/107432] __builtin_convertvector generates inefficient code
  2022-10-27 10:02 [Bug c++/107432] New: __builtin_convertvector generates inefficient code g.peterhoff@t-online.de
                   ` (5 preceding siblings ...)
  2022-10-28  5:33 ` crazylht at gmail dot com
@ 2022-10-28  6:55 ` crazylht at gmail dot com
  2022-10-28 11:41 ` rguenth at gcc dot gnu.org
  2022-10-31 13:02 ` rsandifo at gcc dot gnu.org
  8 siblings, 0 replies; 10+ messages in thread
From: crazylht at gmail dot com @ 2022-10-28  6:55 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107432

--- Comment #7 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Hongtao.liu from comment #6)
> > Guess expand_vector_conversion can be optimized.
> 
>   if (INTEGRAL_TYPE_P (TREE_TYPE (ret_type))
>       && SCALAR_FLOAT_TYPE_P (TREE_TYPE (arg_type)))
>     code = FIX_TRUNC_EXPR;
>   else if (INTEGRAL_TYPE_P (TREE_TYPE (arg_type))
> 	   && SCALAR_FLOAT_TYPE_P (TREE_TYPE (ret_type)))
>     code = FLOAT_EXPR;
> 
> It only supports floatmn2/fix_truncmn2 for float <-> integer.
> 
> But we can also supports extendmn2/zero_extendmn2/truncmn2 for float <->
> float, integer <-> integer.
> 
> Or are there any concerns and VEC_PACK_TRUNC_EXPR,
> VEC_PACK_FIX_TRUNC_EXPR,VEC_PACK_FLOAT_EXPR are used on purpose?

May be we can add some gimple simplication in match.pd to hanlde 
  _4 = VEC_PACK_TRUNC_EXPR <a_1(D), { 0, 0, 0, 0 }>;
  _5 = BIT_FIELD_REF <_4, 128, 0>;

and

  _4 = [vec_unpack_lo_expr] a_1(D);
  _5 = [vec_unpack_hi_expr] a_1(D);
  _2 = {_4, _5};

Since loop vectorizer may also create vec_unpack_lo_expr/vec_unpack_hi_expr.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug target/107432] __builtin_convertvector generates inefficient code
  2022-10-27 10:02 [Bug c++/107432] New: __builtin_convertvector generates inefficient code g.peterhoff@t-online.de
                   ` (6 preceding siblings ...)
  2022-10-28  6:55 ` crazylht at gmail dot com
@ 2022-10-28 11:41 ` rguenth at gcc dot gnu.org
  2022-10-31 13:02 ` rsandifo at gcc dot gnu.org
  8 siblings, 0 replies; 10+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-10-28 11:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107432

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rguenth at gcc dot gnu.org,
                   |                            |rsandifo at gcc dot gnu.org
             Target|X86_64                      |x86_64-*-*
   Last reconfirmed|                            |2022-10-28
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1
            Version|unknown                     |13.0

--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Hongtao.liu from comment #6)
> > Guess expand_vector_conversion can be optimized.
> 
>   if (INTEGRAL_TYPE_P (TREE_TYPE (ret_type))
>       && SCALAR_FLOAT_TYPE_P (TREE_TYPE (arg_type)))
>     code = FIX_TRUNC_EXPR;
>   else if (INTEGRAL_TYPE_P (TREE_TYPE (arg_type))
> 	   && SCALAR_FLOAT_TYPE_P (TREE_TYPE (ret_type)))
>     code = FLOAT_EXPR;
> 
> It only supports floatmn2/fix_truncmn2 for float <-> integer.
> 
> But we can also supports extendmn2/zero_extendmn2/truncmn2 for float <->
> float, integer <-> integer.
> 
> Or are there any concerns and VEC_PACK_TRUNC_EXPR,
> VEC_PACK_FIX_TRUNC_EXPR,VEC_PACK_FLOAT_EXPR are used on purpose?

I think we do support FIX_TRUNC_EXPR or FLOAT_EXPR for float <-> int
conversion of vectors like we now support {CONVERT,NOP}_EXPR for
just widening/shortening.  At least the GIMPLE verifier allows that.

The obtabs would be [us]fix and [us]float, not sure if aarch64 makes use
of those for vector modes or if Richard extended the vectorizer to
consider those (I only remember int <-> int conversions).

So I think if x86_64 can do float <-> int for vectors implementing
[us]fix/[us]float would be the way to go (and of course then make use
of those in lowering/vectorization).

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug target/107432] __builtin_convertvector generates inefficient code
  2022-10-27 10:02 [Bug c++/107432] New: __builtin_convertvector generates inefficient code g.peterhoff@t-online.de
                   ` (7 preceding siblings ...)
  2022-10-28 11:41 ` rguenth at gcc dot gnu.org
@ 2022-10-31 13:02 ` rsandifo at gcc dot gnu.org
  8 siblings, 0 replies; 10+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2022-10-31 13:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107432

--- Comment #9 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #8)
> I think we do support FIX_TRUNC_EXPR or FLOAT_EXPR for float <-> int
> conversion of vectors like we now support {CONVERT,NOP}_EXPR for
> just widening/shortening.  At least the GIMPLE verifier allows that.
> 
> The obtabs would be [us]fix and [us]float, not sure if aarch64 makes use
> of those for vector modes or if Richard extended the vectorizer to
> consider those (I only remember int <-> int conversions).
AArch64 doesn't use mixed-size vector fix and float yet, but the hope
is that would in future.  For SVE, the main difficulty is that FP
conversions could raise exceptions, so only the conditional forms
would be interesting for normal predicated loops under default flags.
The unpredicated optabs would require -ffast-math-like flags.

This is probably lower hanging fruit for Advanced SIMD though.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-10-31 13:03 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-27 10:02 [Bug c++/107432] New: __builtin_convertvector generates inefficient code g.peterhoff@t-online.de
2022-10-27 15:12 ` [Bug target/107432] " pinskia at gcc dot gnu.org
2022-10-27 16:14 ` g.peterhoff@t-online.de
2022-10-28  3:33 ` crazylht at gmail dot com
2022-10-28  3:36 ` crazylht at gmail dot com
2022-10-28  5:22 ` crazylht at gmail dot com
2022-10-28  5:33 ` crazylht at gmail dot com
2022-10-28  6:55 ` crazylht at gmail dot com
2022-10-28 11:41 ` rguenth at gcc dot gnu.org
2022-10-31 13:02 ` rsandifo at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).