Re: [PATCH v2] Re: OpenMP: Generate SIMD clones for functions with "declare target"

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Jakub Jelinek <jakub@redhat.com>
To: Sandra Loosemore <sandra@codesourcery.com>, Jan Hubicka <jh@suse.cz>
Cc: Thomas Schwinge <thomas@codesourcery.com>,
	"gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH v2] Re: OpenMP: Generate SIMD clones for functions with "declare target"
Date: Fri, 30 Sep 2022 12:37:47 +0200	[thread overview]
Message-ID: <YzbG+z8B4rnIVd6U@tucnak> (raw)
In-Reply-To: <001679b1-814a-c1db-5611-c663f6931d11@codesourcery.com>

On Wed, Sep 21, 2022 at 09:17:18PM -0600, Sandra Loosemore wrote:
> On 9/14/22 12:12, Jakub Jelinek wrote:
> 
> > If it is pure optimization thing and purely keyed on the definition,
> > all the simd clones should be local to the TU, never exported from it.
> 
> OK, here is a revised patch that addresses that.  x86_64 target also
> generates a different set of clones for functions with internal linkage vs
> external so I hacked that to treat these implicit clones in the same way as
> other internal clones.
> 
> There is an existing problem with internal "declare simd" clones in that
> nothing ever DCEs clones that end up not being useful, or does a scan of the
> code in the compilation unit before clone generation to avoid generating
> useless clones in the first place.  I haven't tried to solve that problem,
> but I did attempt to mitigate it for these implicit "declare target" clones
> by tagging the option OPT_LEVELS_2_PLUS_SPEED_ONLY (instead of enabling it
> by default all the time) so the clones are not generated by default at -Os
> and -Og.  I added a couple new test cases to check this.

We've discussed this at Cauldron.  Especially for this patch, but less
urgently for explicit declare simd on non-exported functions (less urgently
just because people don't mark everything declare simd usually) solving the
above is essential.  I don't say it can't be done incrementally, but if the
patch is added to trunk, it needs to be solved before 13 branches.
We need to arrange cgraph to process the declare simd clones after the
callers of the corresponding main function, so that by the time we try to
post-IPA optimize the clones we can see if they were actually used or not
and if not, throw them away.

On the other side, for the implicit declare simd (in explicit case it is
user's choice), maybe it might be useful to actually see if the function clone
is vectorizable before deciding whether to actually make use of it.
Because I doubt it will be a good optimization if we clone it, push
arguments into vectors, then because vectorization failed take it appart,
do a serial loop, create return vector from the scalar results and return.
Though, thinking more about it, for the amdgcn case maybe it is worth even
in that case if we manage to vectorize the caller.  Because if failed
vectorization on admgcn means we perform significantly slower, it can be
helpful to have even partial vectorization, vectorize statements that can
be vectorized and for others use a scalar loop.  Our vectorizer is not
prepared to do that right now I believe (which is why e.g. for
#pragma omp ordered simd we just make the whole loop non-vectorizable,
rather than using a scalar loop for stuff in there and vectorize the rest),
but with this optimization we'd effectively achieve that at least at
function call boundaries (though, only in one direction, if the caller can
be vectorized and callee can't; no optimization if caller can't and callee
could be).

> +/* Helper function for mark_auto_simd_clone; return false if the statement
> +   violates restrictions for an "omp declare simd" function.  Specifically,
> +   the function must not
> +   - throw or call setjmp/longjmp
> +   - write memory that could alias parallel calls
> +   - include openmp directives or calls
> +   - call functions that might do those things */
> +
> +static bool
> +auto_simd_check_stmt (gimple *stmt, tree outer)
> +{
> +  tree decl;
> +
> +  switch (gimple_code (stmt))
> +    {
> +    case GIMPLE_CALL:
> +      decl = gimple_call_fndecl (stmt);
> +
> +      /* We can't know whether indirect calls are safe.  */
> +      if (decl == NULL_TREE)
> +	return false;

What about internal function calls?  Are all of them undesirable, or
some of them?  We do have const / pure ifns, ...
> +
> +      /* Calls to functions that are CONST or PURE are ok.  */
> +      if (gimple_call_flags (stmt) & (ECF_CONST | ECF_PURE))
> +	break;
> +
> +      /* Calls to functions that are already marked "omp declare simd" are
> +	 OK.  */
> +      if (lookup_attribute ("omp declare simd", DECL_ATTRIBUTES (decl)))
> +	break;

You could instead look up the cgraph simd clone info for the function...

> +      /* OpenMP directives are not permitted.  */
> +    CASE_GIMPLE_OMP:
> +      return false;

This makes no sense.  The function is called on low GIMPLE during IPA,
there are no GOMP_* statements at this point in the IL, everything has
been expanded.  Most of OpenMP directives though end up calling
libgomp APIs which aren't pure/const and don't have declare simd
attribute...
Exception can be say master construct, or static scheduling nowait
worksharing loop.

> +      /* Conservatively reject all EH-related constructs.  */
> +    case GIMPLE_CATCH:
> +    case GIMPLE_EH_FILTER:
> +    case GIMPLE_EH_MUST_NOT_THROW:
> +    case GIMPLE_EH_ELSE:
> +    case GIMPLE_EH_DISPATCH:
> +    case GIMPLE_RESX:
> +    case GIMPLE_TRY:

Most of these won't appear in low gimple either, I think GIMPLE_RESX
does and GIMPLE_EH_DISPATCH too, the rest probably can't.

> +      return false;
> +
> +      /* Asms are not permitted since we don't know what they do.  */
> +    case GIMPLE_ASM:
> +      return false;

What about volatile stmts?  Even volatile loads should be punted on.

> +
> +    default:
> +      break;
> +    }
> +
> +  /* Memory writes are not permitted.
> +     FIXME: this could be relaxed a little to permit writes to
> +     function-local variables that could not alias other instances
> +     of the function running in parallel.  */
> +  if (gimple_store_p (stmt))
> +    return false;
> +  else
> +    return true;
> +}

> +  FOR_EACH_BB_FN (bb, DECL_STRUCT_FUNCTION (decl))
> +    {
> +      for (gimple_stmt_iterator gsi = gsi_start_bb (bb); !gsi_end_p (gsi);
> +	   gsi_next (&gsi))
> +	if (!auto_simd_check_stmt (gsi_stmt (gsi), decl))
> +	  return NULL_TREE;
> +    }

If you want to punt on exceptions, I guess you could punt on EDGE_EH
or EDGE_ABNORMAL edges out of basic blocks.

> +  attr = lookup_attribute ("omp declare simd",
> +			   DECL_ATTRIBUTES (node->decl));
> +
> +  /* See if we can add an "omp declare simd" directive implicitly
> +     before giving up.  */
> +  /* FIXME: OpenACC "#pragma acc routine" translates into
> +     "omp declare target", but appears also to have some other effects
> +     that conflict with generating SIMD clones, causing ICEs.  So don't
> +     do this if we've got OpenACC instead of OpenMP.  */
> +  if (attr == NULL_TREE
> +      && flag_openmp_target_simd_clone
> +      && !oacc_get_fn_attrib (node->decl))

I admit I don't remember where exactly the simd clone happens wrt. other
IPA passes, but I think it is late pass; so, does it happen for GCN
offloading only in the lto1 offloading compiler?
Shouldn't the auto optimization be then done only in the offloading
lto1 for GCN then (say guard on targetm boolean)?

Otherwise, if we do it say for host offloading fallback as well
(I think it is still undesirable for PTX offloading because it is a waste of
time, there is no vectorization there, it is SIMT instead), it might be
a good idea to check cgraph that the function has at least one caller.

> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/gomp/target-simd-clone-1.c
> @@ -0,0 +1,18 @@
> +/* { dg-options "-fopenmp -O2" } */
> +
> +/* Test that simd clones are generated for functions with "declare target".  */
> +
> +#pragma omp declare target
> +int addit(int a, int b, int c)
> +{
> +  return a + b;
> +}
> +#pragma omp end declare target

Because in cases like this where nothing calls it in the same TU and not LTO
optimizing, creating the internal clones is pure waste of energy.  Nothing
will vectorize using those.

	Jakub

next prev parent reply	other threads:[~2022-09-30 10:37 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-14 17:32 Sandra Loosemore
2022-09-14 18:12 ` Jakub Jelinek
2022-09-22  3:17   ` [PATCH v2] " Sandra Loosemore
2022-09-30 10:37     ` Jakub Jelinek [this message]
2022-10-17  1:23       ` [PATCH v3] " Sandra Loosemore
2022-10-20 14:07         ` Jakub Jelinek
2022-10-27  2:27           ` Sandra Loosemore
2022-10-27 10:09             ` Thomas Schwinge
2022-10-27 20:40               ` Sandra Loosemore
2022-09-14 21:45 ` Thomas Schwinge

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YzbG+z8B4rnIVd6U@tucnak \
    --to=jakub@redhat.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=jh@suse.cz \
    --cc=sandra@codesourcery.com \
    --cc=thomas@codesourcery.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).