Complex multiply optimization working?

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Complex multiply optimization working?
@ 2022-04-11 11:19 Andrew Stubbs
  2022-04-11 12:02 ` Richard Biener
  2022-04-11 12:03 ` Tamar Christina
  0 siblings, 2 replies; 5+ messages in thread
From: Andrew Stubbs @ 2022-04-11 11:19 UTC (permalink / raw)
  To: GCC Development

[-- Attachment #1: Type: text/plain, Size: 1174 bytes --]

Hi all,

I've been looking at implementing the complex multiply patterns for the 
amdgcn port, but I'm not getting the code I was hoping for. When I try 
to use the patterns on x86_64 or AArch64 they don't seem to work there 
either, so is there something wrong with the middle-end? I've tried both 
current HEAD and GCC 11.

The example shown in the internals manual is a simple loop multiplying 
two arrays of complex numbers, and writing the results to a third. I had 
expected that it would use the largest vectorization factor available, 
with the real/imaginary numbers in even/odd lanes as described, but the 
vectorization factor is only 2 (so, a single complex number), and I have 
to set -fvect-cost-model=unlimited to get even that.

I tried another example with SLP and that too uses the cmul patterns 
only for a single real/imaginary pair.

Did proper vectorization of cmul ever really work? There is a case in 
the testsuite for the pattern match, but it isn't in a loop.

Thanks

Andrew

P.S. I attached my testcase, in case I'm doing something stupid.

P.P.S. The manual says the pattern is "cmulm4", etc., but it's actually 
"cmulm3" in the implementation.

[-- Attachment #2: t.c --]
[-- Type: text/plain, Size: 1128 bytes --]

typedef _Complex double complexT;
#define arraysize 256

void f(
complexT a[restrict arraysize],
complexT b[restrict arraysize],
complexT c[restrict arraysize]
       )
{
#if defined(LOOP)
  for (int i = 0; i < arraysize; i++)
    c[i] = a[i] * b[i];
#else

    c[0] = a[0] * b[0];
    c[1] = a[1] * b[1];
    c[2] = a[2] * b[2];
    c[3] = a[3] * b[3];
    c[4] = a[4] * b[4];
    c[5] = a[5] * b[5];
    c[6] = a[6] * b[6];
    c[7] = a[7] * b[7];
    c[8] = a[8] * b[8];
    c[9] = a[9] * b[9];
    c[10] = a[10] * b[10];
    c[11] = a[11] * b[11];
    c[12] = a[12] * b[12];
    c[13] = a[13] * b[13];
    c[14] = a[14] * b[14];
    c[15] = a[15] * b[15];
    c[16] = a[16] * b[16];
    c[17] = a[17] * b[17];
    c[18] = a[18] * b[18];
    c[19] = a[19] * b[19];
    c[20] = a[20] * b[20];
    c[21] = a[21] * b[21];
    c[22] = a[22] * b[22];
    c[23] = a[23] * b[23];
    c[24] = a[24] * b[24];
    c[25] = a[25] * b[25];
    c[26] = a[26] * b[26];
    c[27] = a[27] * b[27];
    c[28] = a[28] * b[28];
    c[29] = a[29] * b[29];
    c[30] = a[30] * b[30];
    c[31] = a[31] * b[31];
    c[32] = a[32] * b[32];
#endif
}

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Complex multiply optimization working?
  2022-04-11 11:19 Complex multiply optimization working? Andrew Stubbs
@ 2022-04-11 12:02 ` Richard Biener
  2022-04-11 12:47   ` Andrew Stubbs
  2022-04-11 12:03 ` Tamar Christina
  1 sibling, 1 reply; 5+ messages in thread
From: Richard Biener @ 2022-04-11 12:02 UTC (permalink / raw)
  To: Andrew Stubbs; +Cc: GCC Development, Tamar Christina

On Mon, Apr 11, 2022 at 1:26 PM Andrew Stubbs <ams@codesourcery.com> wrote:
>
> Hi all,
>
> I've been looking at implementing the complex multiply patterns for the
> amdgcn port, but I'm not getting the code I was hoping for. When I try
> to use the patterns on x86_64 or AArch64 they don't seem to work there
> either, so is there something wrong with the middle-end? I've tried both
> current HEAD and GCC 11.
>
> The example shown in the internals manual is a simple loop multiplying
> two arrays of complex numbers, and writing the results to a third. I had
> expected that it would use the largest vectorization factor available,
> with the real/imaginary numbers in even/odd lanes as described, but the
> vectorization factor is only 2 (so, a single complex number), and I have
> to set -fvect-cost-model=unlimited to get even that.
>
> I tried another example with SLP and that too uses the cmul patterns
> only for a single real/imaginary pair.
>
> Did proper vectorization of cmul ever really work? There is a case in
> the testsuite for the pattern match, but it isn't in a loop.

You need to check the vectorizer dump whether a complex pattern
was recognized or not.  Did you properly use -ffast-math?

>
> Thanks
>
> Andrew
>
> P.S. I attached my testcase, in case I'm doing something stupid.
>
> P.P.S. The manual says the pattern is "cmulm4", etc., but it's actually
> "cmulm3" in the implementation.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: Complex multiply optimization working?
  2022-04-11 11:19 Complex multiply optimization working? Andrew Stubbs
  2022-04-11 12:02 ` Richard Biener
@ 2022-04-11 12:03 ` Tamar Christina
  2022-04-11 12:51   ` Andrew Stubbs
  1 sibling, 1 reply; 5+ messages in thread
From: Tamar Christina @ 2022-04-11 12:03 UTC (permalink / raw)
  To: Andrew Stubbs, GCC Development

HI,

> -----Original Message-----
> From: Andrew Stubbs <ams@codesourcery.com>
> Sent: Monday, April 11, 2022 12:19 PM
> To: GCC Development <gcc@gcc.gnu.org>
> Cc: Tamar Christina <Tamar.Christina@arm.com>
> Subject: Complex multiply optimization working?
> 
> Hi all,
> 
> I've been looking at implementing the complex multiply patterns for the
> amdgcn port, but I'm not getting the code I was hoping for. When I try to use
> the patterns on x86_64 or AArch64 they don't seem to work there either, so
> is there something wrong with the middle-end? I've tried both current HEAD
> and GCC 11.

They work fine in both GCC 11 and HEAD https://godbolt.org/z/Mxxz6qWbP 
Did you actually enable the instructions?

The fully unrolled form doesn't get detected at -Ofast because the SLP vectorizer doesn't
detect TWO_OPERAND nodes as a constructor, see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104406

note:   Final SLP tree for instance 0x2debde0:
note:   node 0x2cdf900 (max_nunits=2, refcnt=2) vector(2) double
note:   op template: _463 = _457 * _460;
note:   	stmt 0 _463 = _457 * _460;
note:   	stmt 1 _464 = _458 * _459;
note:   	children 0x2cdf990 0x2cdfa20
note:   node 0x2cdf990 (max_nunits=2, refcnt=2) vector(2) double
note:   op template: _457 = REALPART_EXPR <MEM[(complexT *)a_101(D) + 512B]>;
note:   	stmt 0 _457 = REALPART_EXPR <MEM[(complexT *)a_101(D) + 512B]>;
note:   	stmt 1 _458 = IMAGPART_EXPR <MEM[(complexT *)a_101(D) + 512B]>;
note:   	load permutation { 64 65 }
note:   node 0x2cdfa20 (max_nunits=2, refcnt=2) vector(2) double
note:   op template: _460 = IMAGPART_EXPR <MEM[(complexT *)b_102(D) + 512B]>;
note:   	stmt 0 _460 = IMAGPART_EXPR <MEM[(complexT *)b_102(D) + 512B]>;
note:   	stmt 1 _459 = REALPART_EXPR <MEM[(complexT *)b_102(D) + 512B]>;
note:   	load permutation { 65 64 }

in the general case, were these to be scalars the benefits are dubious because of the moving between
register files.

At -O3 it works fine (no -Ofast canonization rules rewriting the form) but the cost of the loop is too high to be profitable.
You have to disable The cost model to get it to vectorize where it would use them https://godbolt.org/z/MsGq84WP9
And the vectorizer is right here, the scalar code is cheaper.

The various canonicalization differences at -Ofast makes many different forms
I.e. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104408

But yes, detection is working as intended, but some -Ofast cases are not detected yet.

> 
> The example shown in the internals manual is a simple loop multiplying two
> arrays of complex numbers, and writing the results to a third. I had expected
> that it would use the largest vectorization factor available, with the
> real/imaginary numbers in even/odd lanes as described, but the
> vectorization factor is only 2 (so, a single complex number), and I have to set
> -fvect-cost-model=unlimited to get even that.
> 
> I tried another example with SLP and that too uses the cmul patterns only for
> a single real/imaginary pair.
> 
> Did proper vectorization of cmul ever really work? There is a case in the
> testsuite for the pattern match, but it isn't in a loop.
> 

There are both SLP and LOOP variants in the testsuite. All the patterns are inside of a loop
The mul tests are generated from https://github.com/gcc-mirror/gcc/blob/master/gcc/testsuite/gcc.dg/vect/complex/complex-mul-template.c

Where the tests that use of this template instructs the vectorizer to unroll some cases
and others they're kept as a loop. So both are tested in the testsuite.

> Thanks
> 
> Andrew
> 
> P.S. I attached my testcase, in case I'm doing something stupid.

Both work https://godbolt.org/z/Mxxz6qWbP and https://godbolt.org/z/MsGq84WP9,

Regards,
Tamar

> 
> P.P.S. The manual says the pattern is "cmulm4", etc., but it's actually
> "cmulm3" in the implementation.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Complex multiply optimization working?
  2022-04-11 12:02 ` Richard Biener
@ 2022-04-11 12:47   ` Andrew Stubbs
  0 siblings, 0 replies; 5+ messages in thread
From: Andrew Stubbs @ 2022-04-11 12:47 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Development, Tamar Christina

On 11/04/2022 13:02, Richard Biener wrote:
> You need to check the vectorizer dump whether a complex pattern
> was recognized or not.  Did you properly use -ffast-math?

Aha! I needed to enable -ffast-math.

I missed that this is unsafe, and there's a fall-back to _muldc3 on NaN.

OK, presumably I need to implement a vector version of the fall-back 
libcall if I want this to work without ffast-math.

Thanks

Andrew

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Complex multiply optimization working?
  2022-04-11 12:03 ` Tamar Christina
@ 2022-04-11 12:51   ` Andrew Stubbs
  0 siblings, 0 replies; 5+ messages in thread
From: Andrew Stubbs @ 2022-04-11 12:51 UTC (permalink / raw)
  To: Tamar Christina, GCC Development

On 11/04/2022 13:03, Tamar Christina wrote:
> They work fine in both GCC 11 and HEAD https://godbolt.org/z/Mxxz6qWbP
> Did you actually enable the instructions?

Yes, as I said it uses the instructions, just not fully vectorized. 
Anyway, the problem was I needed -ffast-math to skip the NaN checks.

> There are both SLP and LOOP variants in the testsuite. All the patterns are inside of a loop
> The mul tests are generated from https://github.com/gcc-mirror/gcc/blob/master/gcc/testsuite/gcc.dg/vect/complex/complex-mul-template.c
> 
> Where the tests that use of this template instructs the vectorizer to unroll some cases
> and others they're kept as a loop. So both are tested in the testsuite.

Thanks. This is helpful. My grep skills clearly need work.

Andrew

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-04-11 12:52 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-11 11:19 Complex multiply optimization working? Andrew Stubbs
2022-04-11 12:02 ` Richard Biener
2022-04-11 12:47   ` Andrew Stubbs
2022-04-11 12:03 ` Tamar Christina
2022-04-11 12:51   ` Andrew Stubbs

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).