Re: GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts" (was: [committed] amdgcn: add -march=gfx1030 EXPERIMENTAL)

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Richard Biener <rguenther@suse.de>
To: Thomas Schwinge <tschwinge@baylibre.com>
Cc: Andrew Stubbs <ams@baylibre.com>, gcc-patches@gcc.gnu.org
Subject: Re: GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts" (was: [committed] amdgcn: add -march=gfx1030 EXPERIMENTAL)
Date: Tue, 13 Feb 2024 09:26:57 +0100 (CET)	[thread overview]
Message-ID: <4883749p-27n0-1888-s940-s7q990po84os@fhfr.qr> (raw)
In-Reply-To: <871q9hljxv.fsf@euler.schwinge.ddns.net>

On Mon, 12 Feb 2024, Thomas Schwinge wrote:

> Hi!
> 
> On 2023-10-20T12:51:03+0100, Andrew Stubbs <ams@codesourcery.com> wrote:
> > I've committed this patch
> 
> ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
> "amdgcn: add -march=gfx1030 EXPERIMENTAL".
> 
> The RDNA2 ISA variant doesn't support certain instructions previous
> implemented in GCC/GCN, so a number of patterns etc. had to be disabled:
> 
> > [...] Vector
> > reductions will need to be reworked for RDNA2.  [...]
> 
> > 	* config/gcn/gcn-valu.md (@dpp_move<mode>): Disable for RDNA2.
> > 	(addc<mode>3<exec_vcc>): Add RDNA2 syntax variant.
> > 	(subc<mode>3<exec_vcc>): Likewise.
> > 	(<convop><mode><vndi>2_exec): Add RDNA2 alternatives.
> > 	(vec_cmp<mode>di): Likewise.
> > 	(vec_cmp<u><mode>di): Likewise.
> > 	(vec_cmp<mode>di_exec): Likewise.
> > 	(vec_cmp<u><mode>di_exec): Likewise.
> > 	(vec_cmp<mode>di_dup): Likewise.
> > 	(vec_cmp<mode>di_dup_exec): Likewise.
> > 	(reduc_<reduc_op>_scal_<mode>): Disable for RDNA2.
> > 	(*<reduc_op>_dpp_shr_<mode>): Likewise.
> > 	(*plus_carry_dpp_shr_<mode>): Likewise.
> > 	(*plus_carry_in_dpp_shr_<mode>): Likewise.
> 
> Etc.  The expectation being that GCC middle end copes with this, and
> synthesizes some less ideal yet still functional vector code, I presume.
> 
> The later RDNA3/gfx1100 support builds on top of this, and that's what
> I'm currently working on getting proper GCC/GCN target (not offloading)
> results for.
> 
> I'm seeing a good number of execution test FAILs (regressions compared to
> my earlier non-gfx1100 testing), and I've now tracked down where one
> large class of those comes into existance -- not yet how to resolve,
> unfortunately.  But maybe, with you guys' combined vectorizer and back
> end experience, the latter will be done quickly?
> 
> Richard, I don't know if you've ever run actual GCC/GCN target (not
> offloading) testing; let me know if you have any questions about that.

I've only done offload testing - in the x86_64 build tree run
check-target-libgomp.  If you can tell me how to do GCN target testing
(maybe document it on the wiki even!) I can try do that as well.

> Given that (at least largely?) the same patterns etc. are disabled as in
> my gfx1100 configuration, I suppose your gfx1030 one would exhibit the
> same issues.  You can build GCC/GCN target like you build the offloading
> one, just remove '--enable-as-accelerator-for=[...]'.  Likely, you can
> even use a offloading GCC/GCN build to reproduce the issue below.
> 
> One example is the attached 'builtin-bitops-1.c', reduced from
> 'gcc.c-torture/execute/builtin-bitops-1.c', where 'my_popcount' is
> miscompiled as soon as '-ftree-vectorize' is effective:
> 
>     $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ builtin-bitops-1.c -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/ -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -fdump-tree-all-all -fdump-ipa-all-all -fdump-rtl-all-all -save-temps -march=gfx1100 -O1 -ftree-vectorize
> 
> In the 'diff' of 'a-builtin-bitops-1.c.179t.vect', for example, for
> '-march=gfx90a' vs. '-march=gfx1100', we see:
> 
>     +builtin-bitops-1.c:7:17: missed:   reduc op not supported by target.
> 
> ..., and therefore:
> 
>     -builtin-bitops-1.c:7:17: note:  Reduce using direct vector reduction.
>     +builtin-bitops-1.c:7:17: note:  Reduce using vector shifts
>     +builtin-bitops-1.c:7:17: note:  extract scalar result
> 
> That is, instead of one '.REDUC_PLUS' for gfx90a, for gfx1100 we build a
> chain of summation of 'VEC_PERM_EXPR's.  However, there's wrong code
> generated:
> 
>     $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
>     i=1, ints[i]=0x1 a=1, b=2
>     i=2, ints[i]=0x80000000 a=1, b=2
>     i=3, ints[i]=0x2 a=1, b=2
>     i=4, ints[i]=0x40000000 a=1, b=2
>     i=5, ints[i]=0x10000 a=1, b=2
>     i=6, ints[i]=0x8000 a=1, b=2
>     i=7, ints[i]=0xa5a5a5a5 a=16, b=32
>     i=8, ints[i]=0x5a5a5a5a a=16, b=32
>     i=9, ints[i]=0xcafe0000 a=11, b=22
>     i=10, ints[i]=0xcafe00 a=11, b=22
>     i=11, ints[i]=0xcafe a=11, b=22
>     i=12, ints[i]=0xffffffff a=32, b=64
> 
> (I can't tell if the 'b = 2 * a' pattern is purely coincidental?)
> 
> I don't speak enough "vectorization" to fully understand the generic
> vectorized algorithm and its implementation.  It appears that the
> "Reduce using vector shifts" code has been around for a very long time,
> but also has gone through a number of changes.  I can't tell which GCC
> targets/configurations it's actually used for (in the same way as for
> GCN gfx1100), and thus whether there's an issue in that vectorizer code,
> or rather in the GCN back end, or GCN back end parameterizing the generic
> code?

The "shift" reduction is basically doing reduction by repeatedly
adding the upper to the lower half of the vector (each time halving
the vector size).

> Manually working through the 'a-builtin-bitops-1.c.265t.optimized' code:
> 
>     int my_popcount (unsigned int x)
>     {
>       int stmp__12.12;
>       vector(64) int vect__12.11;
>       vector(64) unsigned int vect__1.8;
>       vector(64) unsigned int _13;
>       vector(64) unsigned int vect_cst__18;
>       vector(64) int [all others];
>     
>       <bb 2> [local count: 32534376]:
>       vect_cst__18 = { [all 'x_8(D)'] };
>       vect__1.8_19 = vect_cst__18 >> { 0, 1, 2, [...], 61, 62, 63 };
>       _13 = .COND_AND ({ [32 x '-1'], [32 x '0'] }, vect__1.8_19, { [all '1'] }, { [all '0'] });
>       vect__12.11_24 = VIEW_CONVERT_EXPR<vector(64) int>(_13);
>       _26 = VEC_PERM_EXPR <vect__12.11_24, { [all '0'] }, { 32, 33, 34, [...], 93, 94, 95 }>;
>       _27 = vect__12.11_24 + _26;
>       _28 = VEC_PERM_EXPR <_27, { [all '0'] }, { 16, 17, 18, [...], 77, 78, 79 }>;
>       _29 = _27 + _28;
>       _30 = VEC_PERM_EXPR <_29, { [all '0'] }, { 8, 9, 10, [...], 69, 70, 71 }>;
>       _31 = _29 + _30;
>       _32 = VEC_PERM_EXPR <_31, { [all '0'] }, { 4, 5, 6, [...], 65, 66, 67 }>;
>       _33 = _31 + _32;
>       _34 = VEC_PERM_EXPR <_33, { [all '0'] }, { 2, 3, 4, [...], 63, 64, 65 }>;
>       _35 = _33 + _34;
>       _36 = VEC_PERM_EXPR <_35, { [all '0'] }, { 1, 2, 3, [...], 62, 63, 64 }>;
>       _37 = _35 + _36;
>       stmp__12.12_38 = BIT_FIELD_REF <_37, 32, 0>;
>       return stmp__12.12_38;
> 
> ..., for example, for 'x = 7', we get:
> 
>       vect_cst__18 = { [all '7'] };
>       vect__1.8_19 = { 7, 3, 1, 0, 0, 0, [...] };
>       _13 = { 1, 1, 1, 0, 0, 0, [...] };
>       vect__12.11_24 = { 1, 1, 1, 0, 0, 0, [...] };
>       _26 = { [all '0'] };
>       _27 = { 1, 1, 1, 0, 0, 0, [...] };
>       _28 = { [all '0'] };
>       _29 = { 1, 1, 1, 0, 0, 0, [...] };
>       _30 = { [all '0'] };
>       _31 = { 1, 1, 1, 0, 0, 0, [...] };
>       _32 = { [all '0'] };
>       _33 = { 1, 1, 1, 0, 0, 0, [...] };
>       _34 = { 1, 0, 0, 0, [...] };
>       _35 = { 2, 1, 1, 0, 0, 0, [...] };
>       _36 = { 1, 1, 0, 0, 0, [...] };
>       _37 = { 3, 2, 1, 0, 0, 0, [...] };
>       stmp__12.12_38 = 3;
>       return 3;
> 
> ..., so the algorithm would appear to synthesize correct code for that
> case.  Adding '7' to 'builtin-bitops-1.c', we however again get:
> 
>     i=13, ints[i]=0x7 a=3, b=6
> 
> 
> With the following hack applied to 'gcc/tree-vect-loop.cc':
> 
>     @@ -6687,8 +6687,9 @@ vect_create_epilog_for_reduction (loop_vec_info loop_vinfo,
>            reduce_with_shift = have_whole_vector_shift (mode1);
>            if (!VECTOR_MODE_P (mode1)
>               || !directly_supported_p (code, vectype1))
>             reduce_with_shift = false;
>     +      reduce_with_shift = false;
> 
> ..., I'm able to work around those regressions: by means of forcing
> "Reduce using scalar code" instead of "Reduce using vector shifts".

I would say it somewhere gets broken between the vectorizer and the GPU
which means likely in the target?  Can you point out an issue in the
actual generated GCN code?

Iff this kind of reduction is the issue you'd see quite a lot of
vectorzer execute FAILs.  I'm seeing a .COND_AND above - could it
be that the "mask" is still set wrong when doing the reduction
steps?

Richard.

next prev parent reply	other threads:[~2024-02-13  8:26 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-20 11:51 [committed] amdgcn: add -march=gfx1030 EXPERIMENTAL Andrew Stubbs
2023-10-27 17:06 ` Andrew Stubbs
2024-01-29 10:34 ` [patch] gcn/gcn-valu.md: Disable fold_left_plus for TARGET_RDNA2_PLUS [PR113615] Tobias Burnus
2024-01-29 12:34   ` Andrew Stubbs
2024-01-29 12:50     ` Tobias Burnus
2024-01-29 15:17       ` Andrew Stubbs
2024-02-16 14:34   ` GCN: Conditionalize 'define_expand "reduc_<fexpander>_scal_<mode>"' on '!TARGET_RDNA2_PLUS' [PR113615] (was: [patch] gcn/gcn-valu.md: Disable fold_left_plus for TARGET_RDNA2_PLUS [PR113615]) Thomas Schwinge
2024-02-16 14:39     ` GCN: Conditionalize 'define_expand "reduc_<fexpander>_scal_<mode>"' on '!TARGET_RDNA2_PLUS' [PR113615] Andrew Stubbs
2024-02-12 16:35 ` GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts" (was: [committed] amdgcn: add -march=gfx1030 EXPERIMENTAL) Thomas Schwinge
2024-02-13  8:26   ` Richard Biener [this message]
2024-02-14 12:56     ` GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts" Andrew Stubbs
2024-02-14 13:27       ` Richard Biener
2024-02-14 13:40         ` Andrew Stubbs
2024-02-14 13:43           ` Richard Biener
2024-02-14 15:23             ` Andrew Stubbs
2024-02-15  7:49               ` Richard Biener
2024-02-15 10:03                 ` Andrew Stubbs
2024-02-15 10:21                   ` Richard Biener
2024-02-15 10:59                     ` Andrew Stubbs
2024-02-15 12:31                       ` Richard Biener
2024-02-15 10:23                 ` Thomas Schwinge
2024-02-15 13:02                   ` Andrew Stubbs
2024-02-16  9:52 ` GCN RDNA2+ vs. GCC SLP vectorizer (was: [committed] amdgcn: add -march=gfx1030 EXPERIMENTAL) Thomas Schwinge
2024-02-16 10:17   ` Richard Biener
2024-02-16 11:22     ` GCN RDNA2+ vs. GCC SLP vectorizer Andrew Stubbs
2024-02-16 12:26       ` Richard Biener
2024-02-16 12:41         ` Andrew Stubbs
2024-02-16 13:53           ` Thomas Schwinge
2024-02-19 10:38             ` Thomas Schwinge
2024-02-19 10:52               ` Richard Biener
2024-02-19 16:31                 ` Thomas Schwinge
2024-02-19 16:35                   ` Thomas Schwinge
2024-02-20  7:44                     ` Richard Biener
2024-02-20  8:46                       ` Thomas Schwinge
2024-02-20  9:13                         ` Richard Biener

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4883749p-27n0-1888-s940-s7q990po84os@fhfr.qr \
    --to=rguenther@suse.de \
    --cc=ams@baylibre.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=tschwinge@baylibre.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).