On 9/30/22 04:37, Jakub Jelinek wrote: > We've discussed this at Cauldron. Especially for this patch, but less > urgently for explicit declare simd on non-exported functions (less urgently > just because people don't mark everything declare simd usually) solving the > above is essential. I don't say it can't be done incrementally, but if the > patch is added to trunk, it needs to be solved before 13 branches. > We need to arrange cgraph to process the declare simd clones after the > callers of the corresponding main function, so that by the time we try to > post-IPA optimize the clones we can see if they were actually used or not > and if not, throw them away. > > On the other side, for the implicit declare simd (in explicit case it is > user's choice), maybe it might be useful to actually see if the function clone > is vectorizable before deciding whether to actually make use of it. > Because I doubt it will be a good optimization if we clone it, push > arguments into vectors, then because vectorization failed take it appart, > do a serial loop, create return vector from the scalar results and return. > Though, thinking more about it, for the amdgcn case maybe it is worth even > in that case if we manage to vectorize the caller. Because if failed > vectorization on admgcn means we perform significantly slower, it can be > helpful to have even partial vectorization, vectorize statements that can > be vectorized and for others use a scalar loop. Our vectorizer is not > prepared to do that right now I believe (which is why e.g. for > #pragma omp ordered simd we just make the whole loop non-vectorizable, > rather than using a scalar loop for stuff in there and vectorize the rest), > but with this optimization we'd effectively achieve that at least at > function call boundaries (though, only in one direction, if the caller can > be vectorized and callee can't; no optimization if caller can't and callee > could be). My sense is that the first approach would be more straightforward than the second one, and I am willing to continue to work on that. However, I think I need some direction to get started, as I presently know nothing about cgraph and I was unable to find any useful overview or interface documentation in the GCC internals manual. Is this as simple as inserting an existing pass into the passlist to clean up after vectorization, or does it involve writing something more or less from scratch? > >> + /* OpenMP directives are not permitted. */ >> + CASE_GIMPLE_OMP: >> + return false; > > This makes no sense. The function is called on low GIMPLE during IPA, > there are no GOMP_* statements at this point in the IL, everything has > been expanded. Most of OpenMP directives though end up calling > libgomp APIs which aren't pure/const and don't have declare simd > attribute... > Exception can be say master construct, or static scheduling nowait > worksharing loop. > >> + /* Conservatively reject all EH-related constructs. */ >> + case GIMPLE_CATCH: >> + case GIMPLE_EH_FILTER: >> + case GIMPLE_EH_MUST_NOT_THROW: >> + case GIMPLE_EH_ELSE: >> + case GIMPLE_EH_DISPATCH: >> + case GIMPLE_RESX: >> + case GIMPLE_TRY: > > Most of these won't appear in low gimple either, I think GIMPLE_RESX > does and GIMPLE_EH_DISPATCH too, the rest probably can't. OK, this was my bad. I cut and pasted this from some code that was originally for the OMP lowering pass. I've moved the entire plausibility filter to a new pass that runs just before OMP lowering. It seems easier to detect the things that are invalid in a cloneable function when they are still in a form closer to the source constructs. >> + return false; >> + >> + /* Asms are not permitted since we don't know what they do. */ >> + case GIMPLE_ASM: >> + return false; > > What about volatile stmts? Even volatile loads should be punted on. That's fixed now too. > >> + attr = lookup_attribute ("omp declare simd", >> + DECL_ATTRIBUTES (node->decl)); >> + >> + /* See if we can add an "omp declare simd" directive implicitly >> + before giving up. */ >> + /* FIXME: OpenACC "#pragma acc routine" translates into >> + "omp declare target", but appears also to have some other effects >> + that conflict with generating SIMD clones, causing ICEs. So don't >> + do this if we've got OpenACC instead of OpenMP. */ >> + if (attr == NULL_TREE >> + && flag_openmp_target_simd_clone >> + && !oacc_get_fn_attrib (node->decl)) > > I admit I don't remember where exactly the simd clone happens wrt. other > IPA passes, but I think it is late pass; so, does it happen for GCN > offloading only in the lto1 offloading compiler? > Shouldn't the auto optimization be then done only in the offloading > lto1 for GCN then (say guard on targetm boolean)? I'm afraid I don't know much about offloading, but I was under the impression it all goes through the same compilation process, just with a different target? > Otherwise, if we do it say for host offloading fallback as well > (I think it is still undesirable for PTX offloading because it is a waste of > time, there is no vectorization there, it is SIMT instead), it might be > a good idea to check cgraph that the function has at least one caller. As I said previously, I don't understand cgraph, but in my new patch I arranged things so that the implicit clones are only created if there is also a call to the function found in an OMP loop (not just one caller anywhere). So this should be fixed now. New patch attached. Is this one OK for mainline? -Sandra