From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=F4mR=BE=redhat.com=jakub@sourceware.org>
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	by sourceware.org (Postfix) with ESMTPS id 0BF453858C53
	for <fortran@gcc.gnu.org>; Mon, 15 May 2023 10:19:11 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 0BF453858C53
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=redhat.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1684145950;
	h=from:from:reply-to:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:in-reply-to:in-reply-to:  references:references;
	bh=9tt38e83XECUbLSAUkuogVs3zb3MNf3pzlI+yqlz5EI=;
	b=Zd5cr/GNWlQLfbFLercoYyCNTEDAUoMJvNhRshwlwjobkB8l7Fufd2g/gInqK/W0qq9qum
	6GS99zDBYJlK2CTpUB0mR20kctnMRYnKuP7c2UbnlLthhcCrhfkHWmII3GBenO4oybIR1m
	cI19QzK7MXniVpcx9RPVJRPxQZX7E44=
Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com
 [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-374-PT2MmGNKO827tyuExPI6BA-1; Mon, 15 May 2023 06:19:05 -0400
X-MC-Unique: PT2MmGNKO827tyuExPI6BA-1
Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.rdu2.redhat.com [10.11.54.5])
	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 1368D282CCA8;
	Mon, 15 May 2023 10:19:05 +0000 (UTC)
Received: from tucnak.zalov.cz (unknown [10.39.192.17])
	by smtp.corp.redhat.com (Postfix) with ESMTPS id AADB063F3D;
	Mon, 15 May 2023 10:19:04 +0000 (UTC)
Received: from tucnak.zalov.cz (localhost [127.0.0.1])
	by tucnak.zalov.cz (8.17.1/8.17.1) with ESMTPS id 34FAJ1a0175252
	(version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NOT);
	Mon, 15 May 2023 12:19:02 +0200
Received: (from jakub@localhost)
	by tucnak.zalov.cz (8.17.1/8.17.1/Submit) id 34FAJ0Pi175251;
	Mon, 15 May 2023 12:19:00 +0200
Date: Mon, 15 May 2023 12:19:00 +0200
From: Jakub Jelinek <jakub@redhat.com>
To: Frederik Harwath <frederik@codesourcery.com>
Cc: gcc-patches@gcc.gnu.org, fortran@gcc.gnu.org, tobias@codesourcery.com,
        joseph@codesourcery.com, jason@redhat.com
Subject: Re: [PATCH 0/7] openmp: OpenMP 5.1 loop transformation directives
Message-ID: <ZGIHFEMt6Tky2eq/@tucnak>
Reply-To: Jakub Jelinek <jakub@redhat.com>
References: <20230324153046.3996092-1-frederik@codesourcery.com>
MIME-Version: 1.0
In-Reply-To: <20230324153046.3996092-1-frederik@codesourcery.com>
X-Scanned-By: MIMEDefang 3.1 on 10.11.54.5
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <fortran.gcc.gnu.org>

On Fri, Mar 24, 2023 at 04:30:38PM +0100, Frederik Harwath wrote:
> this patch series implements the OpenMP 5.1 "unroll" and "tile"
> constructs.  It includes changes to the C,C++, and Fortran front end
> for parsing the new constructs and a new middle-end
> "omp_transform_loops" pass which implements the transformations in a
> source language agnostic way.

I'm afraid we can't do it this way, at least not completely.

The OpenMP requirements and what is being discussed for further loop
transformations pretty much requires parts of it to be done as soon as possible.
My understanding is that that is where other implementations implement that
too and would also prefer GCC not to be the only implementation that takes
significantly different decision in that case from other implementations
like e.g. in the offloading case (where all other implementations
preprocess/parse etc. source multiple times compared to GCC splitting stuff
only at IPA time; this affects what can be done with metadirectives,
declare variant etc.).
Now, e.g. data sharing is done almost exclusively during gimplification,
the proposed pass is later than that; it needs to be done before the data
sharing.  Ditto doacross handling.
The normal loop constructs (OMP_FOR, OMP_SIMD, OMP_DISTRIBUTE, OMP_LOOP)
already need to know given their collapse/ordered how many loops they are
actually associated with and the loop transformation constructs can change
that.
So, I think we need to do the loop transformations in the FEs, that doesn't
mean we need to write everything 3 times, once for each frontend.
Already now, e.g. various stuff is shared between C and C++ FEs in c-family,
though how much can be shared between c-family and Fortran is to be
discovered.

Or at least partially, to the extent that we compute how many canonical
loops the loop transformations result in, what artificial iterators they
will use etc., so that during gimplification we can take all that into
account and then can do the actual transformations later.

For C, I think the lowering of loop transformation constructs or at least
determining what it means can be done right after we actually parse it and
before we finalize the OMP_FOR eetc. that wraps it if any.  As discussed last
week at F2F, I think we want to remember in OMP_FOR_ORIG_DECLS the user
iterators on the loop transformation constructs and take it into account
for data sharing purposes.

For C++ in templates we obviously need to defer that until instantiations,
the constants in the clauses etc. could be template parameters etc.

For Fortran during resolving.

>  The "unroll" and "tile" directives are
> internally implemented as clauses.  This fits the representation of

So perhaps just use OMP_UNROLL/OMP_TILE as GENERIC constructs like
OMP_FOR etc. but with some argument where from the early loop
transformation analysis you can remember the important stuff, whether
does the loop transformation result in a canonical loop nest or not
and in the former case with how many nested loops.

And then handle the actual transformation IMHO best at gimplification
time, find them in the OMP_FOR etc. body if they are nested in there,
let the transformation happen on GENERIC before the containing OMP_FOR
etc. if any is actually finalized and from the transformation remember
the original user decls and what should happen with them for data sharing
(e.g. lastprivate/lastprivate conditional).
>From the slides I saw last week, a lot of other transformations are in the
planning, like loop reversal etc.
And, I think even in OpenMP 5.1 nothing prevents e.g.
#pragma omp for collapse(3) // etc.
#pragma omp tile sizes (4, 2, 2)
#pragma omp tile sizes (4, 8, 16)
for (int i = 0; i < 64; ++i)
  for (int j = 0; j < 64; ++j)
    for (int k = 0; k < 64; ++k)
      body;
where the inner tile takes the i and j loops and makes
for (int i1 = 0; i1 < 64; i1 += 4)
  for (int j1 = 0; j1 < 64; j1 += 8)
    for (int k1 = 0; k1 < 64; k1 += 16)
      for (int i2 = 0; i2 < 4; i2++)
	{
	  int i = i1 + i2;
	  for (int j2 = 0; j2 < 8; j2++)
	    {
	      int j = j1 + j2;
	      for (int k2 = 0; k2 < 16; k2++)
		{
		  int k = k1 + k2;
		  body;
		}
	    }
	}
out of it with 3 outer loops which have canonical loop form (the rest
doesn't).  And then the second tile takes the outermost 3 of those generated
loops and tiles them again, making it into again 3 canonical loop form
loops plus stuff inside of it.
Or one can replace the
#pragma omp for collapse(3) // etc.
with
#pragma omp for
#pragma omp unroll partial(2)
which furthermore unrolls the outermost generated loop from the outer tile
turning it into 1 canonical loop form loop plus stuff in it.
Or of course as you have in your testcases, some loop transformation
constructs could be used on more nested loops, not necessarily before
the outermost one.  But still, in all cases you need to know quite early
how many canonical loop form nested loops you get from each loop
transformation, so that it can be e.g. checked against the collapse/ordered
clauses.

Feel free to disagree if you think your approach is able to handle all of
this, just put details in why do you think so.

	Jakub