From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=mPic=BF=redhat.com=jakub@sourceware.org>
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	by sourceware.org (Postfix) with ESMTPS id 6E9C1385783F
	for <gcc-patches@gcc.gnu.org>; Tue, 16 May 2023 11:00:42 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 6E9C1385783F
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=redhat.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1684234842;
	h=from:from:reply-to:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=Q5B1r/RFAvowBBMmRZ+jGzaRi20k5rofAFrYhD33+3w=;
	b=cnS8wzXiYzhLRsdV9Nf+EtATUUakqnXmW4kK0p8xa2pmAmJk+8PfzcjXX8tak/jsKZjDyB
	RmsqEBgoDUAcJAqtOk+PHCzAAh9FFJUA2puPelOR7XO458KK48xbm/aVefwrlb79EWyxaE
	d+ASn1oGp+KojBXci7hD+oWAQVkQgew=
Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com
 [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-7-i3jAz0yTMka2-OGNUco8wA-1; Tue, 16 May 2023 07:00:38 -0400
X-MC-Unique: i3jAz0yTMka2-OGNUco8wA-1
Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.rdu2.redhat.com [10.11.54.3])
	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 282CA1C01B3E;
	Tue, 16 May 2023 11:00:38 +0000 (UTC)
Received: from tucnak.zalov.cz (unknown [10.39.192.17])
	by smtp.corp.redhat.com (Postfix) with ESMTPS id CA3481121315;
	Tue, 16 May 2023 11:00:37 +0000 (UTC)
Received: from tucnak.zalov.cz (localhost [127.0.0.1])
	by tucnak.zalov.cz (8.17.1/8.17.1) with ESMTPS id 34GB0YiY195678
	(version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NOT);
	Tue, 16 May 2023 13:00:35 +0200
Received: (from jakub@localhost)
	by tucnak.zalov.cz (8.17.1/8.17.1/Submit) id 34GB0XBg195677;
	Tue, 16 May 2023 13:00:33 +0200
Date: Tue, 16 May 2023 13:00:33 +0200
From: Jakub Jelinek <jakub@redhat.com>
To: Frederik Harwath <frederik.harwath@siemens.com>
Cc: Frederik Harwath <frederik@codesourcery.com>, gcc-patches@gcc.gnu.org,
        fortran@gcc.gnu.org, tobias@codesourcery.com, joseph@codesourcery.com,
        jason@redhat.com
Subject: Re: [PATCH 0/7] openmp: OpenMP 5.1 loop transformation directives
Message-ID: <ZGNiUS9h+sk7SjOe@tucnak>
Reply-To: Jakub Jelinek <jakub@redhat.com>
References: <20230324153046.3996092-1-frederik@codesourcery.com>
 <ZGIHFEMt6Tky2eq/@tucnak>
 <dc5b9461-0ac6-7eac-d75f-6f4ec2531b23@siemens.com>
MIME-Version: 1.0
In-Reply-To: <dc5b9461-0ac6-7eac-d75f-6f4ec2531b23@siemens.com>
X-Scanned-By: MIMEDefang 3.1 on 10.11.54.3
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
X-Spam-Status: No, score=-3.7 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On Tue, May 16, 2023 at 11:45:16AM +0200, Frederik Harwath wrote:
> The place where different compilers implement the loop transformations
> was discussed in an OpenMP loop transformation meeting last year. Two
> compilers (another one and GCC with this patch series) transformed the loops
> in the middle end after the handling of data sharing, one planned to do so.
> Yet another vendor had not yet decided where it will be implemented. Clang
> currently does everything in the front end, but it was mentioned that this
> might change in the future e.g. for code sharing with Flang. Implementing
> the loop transformations late could potentially
> complicate the implementation of transformations which require adjustments
> of the data sharing clauses, but this is known and consequentially, no such

When already in the FE we determine how many canonical loops a particular
loop transformation creates, I think the primary changes I'd like to see is
really have OMP_UNROLL/OMP_TILE GENERIC statements (see below) and consider
where is the best spot to lower it.  I believe for data sharing it is best
done during gimplification before the containing loops are handled, it is
already shared code among all the FEs, I think will make it easier to handle
data sharing right and gimplification is also where doacross processing is
done.  While there is restriction that ordered clause is incompatible with
generated loops from tile construct, there isn't one for unroll (unless
"The ordered clause must not appear on a worksharing-loop directive if the associated loops
include the generated loops of a tile directive."
means unroll partial implicitly because partial unroll tiles the loop, but
it doesn't say it acts as if it was a tile construct), so we'd have to handle
#pragma omp for ordered(2)
for (int i = 0; i < 64; i++)
  #pragma omp unroll partial(4)
  for (int j = 0; j < 64; j++)
    {
      #pragma omp ordered depend (sink: i - 1, j - 2)
      #pragma omp ordered depend (source)
    }
and I think handling it after gimplification is going to be increasingly
harder.  Of course another possibility is ask lang committee to clarify
unless it has been clarified already in 6.0 (but in TR11 it is not).
Also, I think creating temporaries is easier to be done during
gimplification than later.

Another option is as you implemented a separate pre-omp-lowering pass,
and another one would be do it in the omplower pass, which has actually
several subpasses internally, do it in the scan phase.  Disadvantage of
a completely separate pass is that we have to walk the whole IL again,
while doing it in the scan phase means we avoid that cost.  We already
do there similar transformations, scan_omp_simd transforms simd constructs
into if (...) simd else simt and then we process it with normal scan_omp_for
on what we've created.  So, if you insist doing it after gimplification
perhaps for compatibility with other non-LLVM compilers, I'd prefer to
do it there rather than in a completely separate pass.

> transformations are planned for OpenMP 6.0. In particular, the "apply"
> clause therefore only permits loop-transforming constructs to be applied to
> the loops generated from other loop
> transformations in TR11.
> 
> > The normal loop constructs (OMP_FOR, OMP_SIMD, OMP_DISTRIBUTE, OMP_LOOP)
> > already need to know given their collapse/ordered how many loops they are
> > actually associated with and the loop transformation constructs can change
> > that.
> > So, I think we need to do the loop transformations in the FEs, that doesn't
> > mean we need to write everything 3 times, once for each frontend.
> > Already now, e.g. various stuff is shared between C and C++ FEs in c-family,
> > though how much can be shared between c-family and Fortran is to be
> > discovered.
> > Or at least partially, to the extent that we compute how many canonical
> > loops the loop transformations result in, what artificial iterators they
> > will use etc., so that during gimplification we can take all that into
> > account and then can do the actual transformations later.
> 
> The patches in this patch series already do compute how many canonical
> loop nests result from the loop transformations in the front end.

Good.

> This is necessary to represent the loop nest that is affected by the
> loop transformations by a single OMP_FOR to meet the expectations
> of all later OpenMP code transformations. This is also the major
> reason why the loop transformations are represented by clauses
> instead of representing them as  "OMP_UNROLL/OMP_TILE as
> GENERIC constructs like OMP_FOR" as you suggest below. Since the

I really don't see why.  We try to represent what we see in the source
as OpenMP constructs as those constructs.  We already have a precedent
with composite loop constructs, where for the combined constructs which
aren't innermost we temporarily use NULL OMP_FOR_{INIT,COND,INCR,ORIG_DECLS}
vectors to stand for this will be some loop, but the details for it aren't
known yet, to be filled up later.  So, why can't we similarly represent
#pragma omp for collapse(3)
#pragma omp tile sizes (4, 2, 2)
#pragma omp tile sizes (4, 8, 16)
for (int i = 0; i < 64; ++i)
  for (int j = 0; j < 64; ++j)
    for (int k = 0; k < 64; ++k)
      body;
as OMP_FOR with NULL OMP_FOR_{INIT,COND,INCR,ORIG_DECLS}
with the appropriate clauses on it, with
OMP_TILE (again, right clauses, NULL OMP_FOR_{INIT,COND,INCR,ORIG_DECLS})
and another OMP_TILE, this time with all the vectors filled in in GENERIC?

#pragma omp for collapse(2)
for (int i = 0; i < 64; ++i)
#pragma omp tile sizes (4)
  for (int j = 0; j < 64; ++j)
would be represented by non-NULL vectors which would have all the inner
entries NULL (the outer loop is not generated loop, the inner one is
generated) with OMP_TILE inside of it.

Then depending on where the loop transformation is actually performed,
we'd either need to preserve such shape from gimplification until the
loop transformations are applied, or would be solely on GENERIC and
GIMPLE would already have the transformed loops.

Clauses e.g. have the disadvantage that generally they aren't ordered.
If it is separate statements, it is e.g. easier to print it right in
original dump, so that people can compare the loops before the
transformation and after it.

> You suggest to implement the loop transformations during gimplification.
> I am not sure if gimplification is actually well-suited to implement the
> depth-first evaluation of the loop transformations. I also believe that

Why not?  The loop transformation constructs can't be deeply nested in the
bodies, they need to be close.
gimplify_omp_for already searches the body for the case of composite
constructs - if (OMP_FOR_INIT (for_stmt) == NULL_TREE) early in it.
So, this would just mean doing it if that condition is true also looking
for loop transformation constructs (if they are found, pass in the
containing OMP_{FOR,SIMD,LOOP,DISTRIBUTE,TASKLOOP} if any to a routine
that handles the transformation, such that it can update the containing
looping construct if any during the transformation.
That alone would handle the case where the looping construct should work
solely with the generated loop.  It would need to do the same thing
also if OMP_FOR_INIT (for_stmt) is non-NULL but
TREE_VEC_ELT (OMP_FOR_INIT (for_stmt), TREE_VEC_LENGTH (OMP_FOR_INIT (for_stmt)))
is NULL to handle the case where generated loops are just some of the inner
ones.
And then when gimplify_omp_for would encounter an OMP_TILE/OMP_UNROLL
loop on its own (i.e. not nested inside some other loop), it would similarly
find further transform constructs in it like the above but then would just
normally do the loop transformation, with NULL_TREE for the containing loop,
meaning it is a sequential stuff.

	Jakub