size value of vector_size attribute

public inbox for gcc-help@gcc.gnu.org
 help / color / mirror / Atom feed

* size value of vector_size attribute
@ 2019-12-16  6:14 Xi Ruoyao
  2019-12-16  7:19 ` Marc Glisse
  0 siblings, 1 reply; 5+ messages in thread
From: Xi Ruoyao @ 2019-12-16  6:14 UTC (permalink / raw)
  To: gcc-help

Hi,

Is there any reason to enforce "x must be a power of 2" in
__attribute__((vector_size(x)))?

I want to use this attribute in my source code to simplify coding
(instead of utilizing SIMD instructions, normally).  Someone may argue
that I should use std::valarray but it is stupidly slow.  Now with this
restriction on size value I may have to write something like
std::valarray but w/o dynamic allocation.
-- 
Xi Ruoyao <xry111@mengyan1223.wang>
School of Aerospace Science and Technology, Xidian University

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: size value of vector_size attribute
  2019-12-16  6:14 size value of vector_size attribute Xi Ruoyao
@ 2019-12-16  7:19 ` Marc Glisse
  2019-12-16 13:16   ` Chris Elrod
  0 siblings, 1 reply; 5+ messages in thread
From: Marc Glisse @ 2019-12-16  7:19 UTC (permalink / raw)
  To: Xi Ruoyao; +Cc: gcc-help

On Mon, 16 Dec 2019, Xi Ruoyao wrote:

> Is there any reason to enforce "x must be a power of 2" in
> __attribute__((vector_size(x)))?
>
> I want to use this attribute in my source code to simplify coding
> (instead of utilizing SIMD instructions, normally).  Someone may argue
> that I should use std::valarray but it is stupidly slow.  Now with this
> restriction on size value I may have to write something like
> std::valarray but w/o dynamic allocation.

See PR53024. One main reason is that supporting it would be some work, for 
not enough demand. Also, it can be done in user code, compiler support is 
not necessary (it would be convenient though). Even lowering an 
unsupported power of 2 to a set of smaller vectors still generates pretty 
bad code IIRC.

By the way, for 3 double on x86, would you prefer __m128d+double, or 
__m256d with one slot ignored?

-- 
Marc Glisse

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: size value of vector_size attribute
  2019-12-16  7:19 ` Marc Glisse
@ 2019-12-16 13:16   ` Chris Elrod
  2019-12-16 15:59     ` Xi Ruoyao
  0 siblings, 1 reply; 5+ messages in thread
From: Chris Elrod @ 2019-12-16 13:16 UTC (permalink / raw)
  To: gcc-help

I'm not the asked, but I would strongly prefer m256 if code could be
generated masking the unused lane for safe loads/stores, at least on
architectures where this is efficient (eg, Skylake-X).
This automatic masking would make writing SIMD code easier when you don't
have powers of 2, by saving the effort of passing the bitmask to each
operation (which is at least an option with intrimin.h, not sure about
GCC's built-in).

However, if the asker doesn't want this for SIMD code, but wants a
convenient vector to index for scalar code, I'd recommend defining your own
class. Indexing SIMD vectors is inefficient, and it may interfere with
optimizations like SROA. But I could be wrong; my experience is mostly with
Julia which uses LLVM. GCC may do better.

On Mon, Dec 16, 2019, 02:19 Marc Glisse <marc.glisse@inria.fr wrote:

> On Mon, 16 Dec 2019, Xi Ruoyao wrote:
>
> > Is there any reason to enforce "x must be a power of 2" in
> > __attribute__((vector_size(x)))?
> >
> > I want to use this attribute in my source code to simplify coding
> > (instead of utilizing SIMD instructions, normally).  Someone may argue
> > that I should use std::valarray but it is stupidly slow.  Now with this
> > restriction on size value I may have to write something like
> > std::valarray but w/o dynamic allocation.
>
> See PR53024. One main reason is that supporting it would be some work, for
> not enough demand. Also, it can be done in user code, compiler support is
> not necessary (it would be convenient though). Even lowering an
> unsupported power of 2 to a set of smaller vectors still generates pretty
> bad code IIRC.
>
> By the way, for 3 double on x86, would you prefer __m128d+double, or
> __m256d with one slot ignored?
>
> --
> Marc Glisse
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: size value of vector_size attribute
  2019-12-16 13:16   ` Chris Elrod
@ 2019-12-16 15:59     ` Xi Ruoyao
  2019-12-16 17:52       ` Chris Elrod
  0 siblings, 1 reply; 5+ messages in thread
From: Xi Ruoyao @ 2019-12-16 15:59 UTC (permalink / raw)
  To: Chris Elrod; +Cc: gcc-help

On 2019-12-16 08:16 -0500, Chris Elrod wrote:
> I'm not the asked, but I would strongly prefer m256 if code could be
> generated masking the unused lane for safe loads/stores, at least on
> architectures where this is efficient (eg, Skylake-X).
> This automatic masking would make writing SIMD code easier when you
> don't
> have powers of 2, by saving the effort of passing the bitmask to each
> operation (which is at least an option with intrimin.h, not sure
> about
> GCC's built-in).

I perfer m256 too.  I'm already using vector_size(4*sizeof(double)) for
some calculation in 3D euclid space (only 3 elements are really used).

> However, if the asker doesn't want this for SIMD code, but wants a
> convenient vector to index for scalar code, I'd recommend defining
> your own
> class. Indexing SIMD vectors is inefficient, and it may interfere
> with
> optimizations like SROA. But I could be wrong; my experience is
> mostly with
> Julia which uses LLVM. GCC may do better.

I want SIMD code and I don't need much indexing.  But just curious, why
indexing SIMD vectors is inefficient?
-- 
Xi Ruoyao <xry111@mengyan1223.wang>
School of Aerospace Science and Technology, Xidian University

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: size value of vector_size attribute
  2019-12-16 15:59     ` Xi Ruoyao
@ 2019-12-16 17:52       ` Chris Elrod
  0 siblings, 0 replies; 5+ messages in thread
From: Chris Elrod @ 2019-12-16 17:52 UTC (permalink / raw)
  To: gcc-help

> I perfer m256 too.  I'm already using vector_size(4*sizeof(double)) for
some calculation in 3D euclid space (only 3 elements are really used).

Then I'd expect m256 to give you good performance for a lot of it; ie, 1
operation vs 2 (1x m128 and 1x m64) or 3 (3x m64).
I wish this was a trick compilers did more often. ISPC might, but I haven't
played with it as much as I'd like.

> I want SIMD code and I don't need much indexing.  But just curious, why
indexing SIMD vectors is inefficient?

If you don't do much indexing, it should be fine. Basically, to extract an
element, compilers seem to store the vector, and then load the desired
element.

Here are a couple simple examples:
https://godbolt.org/z/4EH_Nk

I used -O3 with gcc and clang.
index1 just indexes a pointer like normal. gcc and clang both use a single
vmovsd.
index2 loads a vector, and then indexes the vector. clang is able to reduce
that into being equivalent to index1, but gcc loads the vector, re-stores
the vector elsewhere, and then vmovsd from that stored vector.
index3 takes a vector as an argument, and then indexes it. Neither compiler
generated nice looking code here. They both stored the vector, and then
vmovsd to load the single element.

The index2 example with gcc was also particularly problematic, in that the
vector wasn't totally transparent to the optimizer.
Still, not a big deal if you don't do much indexing. If the autovectorizer
vectorized your scalar code, it would still have to resort to the same
tricks (storing a vector and reloading a scalar) to extract a scalar. So
you aren't losing anything.
Meaning it's probably totally fine, or much better than fine if the SIMD is
profitable.

On Mon, Dec 16, 2019 at 10:59 AM Xi Ruoyao <xry111@mengyan1223.wang> wrote:

> On 2019-12-16 08:16 -0500, Chris Elrod wrote:
> > I'm not the asked, but I would strongly prefer m256 if code could be
> > generated masking the unused lane for safe loads/stores, at least on
> > architectures where this is efficient (eg, Skylake-X).
> > This automatic masking would make writing SIMD code easier when you
> > don't
> > have powers of 2, by saving the effort of passing the bitmask to each
> > operation (which is at least an option with intrimin.h, not sure
> > about
> > GCC's built-in).
>
> I perfer m256 too.  I'm already using vector_size(4*sizeof(double)) for
> some calculation in 3D euclid space (only 3 elements are really used).
>
> > However, if the asker doesn't want this for SIMD code, but wants a
> > convenient vector to index for scalar code, I'd recommend defining
> > your own
> > class. Indexing SIMD vectors is inefficient, and it may interfere
> > with
> > optimizations like SROA. But I could be wrong; my experience is
> > mostly with
> > Julia which uses LLVM. GCC may do better.
>
> I want SIMD code and I don't need much indexing.  But just curious, why
> indexing SIMD vectors is inefficient?
> --
> Xi Ruoyao <xry111@mengyan1223.wang>
> School of Aerospace Science and Technology, Xidian University
>
>

-- 
https://github.com/chriselrod?tab=repositories
https://www.linkedin.com/in/chris-elrod-9720391a/

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2019-12-16 17:52 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-16  6:14 size value of vector_size attribute Xi Ruoyao
2019-12-16  7:19 ` Marc Glisse
2019-12-16 13:16   ` Chris Elrod
2019-12-16 15:59     ` Xi Ruoyao
2019-12-16 17:52       ` Chris Elrod

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).