Re: adding attributes - Bernhard Reutner-Fischer

public inbox for fortran@gcc.gnu.org
 help / color / mirror / Atom feed

From: Bernhard Reutner-Fischer <rep.dot.nop@gmail.com>
To: Dave Love via Fortran <fortran@gcc.gnu.org>
Cc: rep.dot.nop@gmail.com, Dave Love <dave.love@manchester.ac.uk>
Subject: Re: adding attributes
Date: Thu, 3 Nov 2022 00:19:26 +0100	[thread overview]
Message-ID: <20221103001926.725fd9bf@nbbrfq> (raw)
In-Reply-To: <87edund73d.fsf@manchester.ac.uk>

On Mon, 31 Oct 2022 21:19:18 +0000
Dave Love via Fortran <fortran@gcc.gnu.org> wrote:

> Bernhard Reutner-Fischer via Fortran <fortran@gcc.gnu.org> writes:

> > Ideally the syntax would be the same as in C.  
> 
> Right.  I hoped it would be possible to lift machinery easily from C.

Lifting that won't work easily, no.

> There's no standard method for this sort of portable performance
> engineering as far as I can tell.  The best I could see was specifying a
> SIMD length statically in OpenMP.  I'm interested in things that
> potentially make the difference between, say, vectorization for AVX2 or
> full-width AVX512 versus SSE2 for profiled host-spots.  I fully agree

I see.
So target_clones is one thing. What other attributes would be important?

> about measurement and not doing things blindly, and I prize
> maintainability.  However, target_clones is clearly better than the
> existing facility for explicit, target-independent unrolling, for instance.

Yes. Unroll is certainly only applicable in a few places, sure.
> 
> > In former times, you would compile your library multiple times
> > and provide a distinct, optimized version for each of the CPUs.
> > Maybe that would work for you equally well, without target_clones?  
> 
> "Former times" to me means, say, GEC 4000 v. IBM 370 and the aftermath
> of "all the world's a VAX", rather than different x86
> micro-architectures...  I do now work on both x86_64 and POWER.

In your job script you would use cpuid(1) to determine a properly tuned
binary for the parts of the cluster you run on. Or the installed
binaries are tuned for the host they are installed on and are located in
a uniform place per application.

> 
> Multiple compilation isn't a good solution.  I haven't followed the

It might not be good, but it's cheap and easy if you only have a small
set of different arches and subarches each. In a controlled
environment, with a batch scheduler. Won't work in the wild of course.

> current state of hardware capability support, but relevant systems don't
> have it on x86_64, at least.  That wouldn't help kernels of your
> simulation code that aren't abstracted into a library or set up for
> dynamic dispatch anyway.  I don't have a specific instance in mind, but
> consider OS packaging, which I do; that currently has to be built for
> base x86_64 (SSE2) for EPEL, at least, and so could miss a factor of
> several performance from vectorized.

For packaging for global use that won't work all that well indeed.

But since you cannot mix target_clones across arch-boundaries,
supporting those for a distro will probably be rather ugly anyway.
I think that's what's gentoo et al are for, or your privately rebuilt
debian repo; provide a tuned world for everybody, individually ;) But as
you mentioned EPEL i never said that :)

> 
> > HTH  
> 
> Thanks.  Definitely a more helpful response than when I asked about
> doing something previously!  (I don't know if I'll actually be able to
> work on it in the end, at least on work time.)

heh, me neither. Luckily yesterday was a holiday, so what i ended up
with was the following, fya.
Consider:
$ grep -v "^\!\!" /scratch/src/gcc-13.mine/gcc/testsuite/gfortran.dg/attr_target_clones-1.f90;echo EOF
! { dg-do compile }
! { dg-options "-O1 -fdump-tree-optimized" }
!
! Test __attribute__ ((target_clones ("foo", "bar")))
!
module m
  implicit none
contains
  subroutine sub1()
!GCC$ ATTRIBUTES target_clones("avx", "sse","default") :: sub1
    print *, 4321
  end
end module m
! { dg-final { scan-tree-dump-times {void * __m_MOD_sub1.resolver ()} "optimized" 1 } }
! { dg-final { scan-tree-dump-times {void __m_MOD_sub1.avx ()} "optimized" 1 } }
! { dg-final { scan-tree-dump-times {void __m_MOD_sub1.sse ()} "optimized" 1 } }
!!! { dg-final { scan-tree-dump-times {XXX something sub1.default ()} "optimized" 1 } }
! { dg-final { scan-tree-dump-not {void sub1 ()} "optimized" } }
EOF
Which gives
$ ./gfortran -B. -o /tmp/out.o -c /scratch/src/gcc-13.mine/gcc/testsuite/gfortran.dg/attr_target_clones-1.f90 -O2 -fdump-tree-original -fdump-tree-optimized
/tmp/ccxpGd9Y.s: Assembler messages:
/tmp/ccxpGd9Y.s:118: Error: symbol `__m_MOD_sub1' is already defined

That's because that ends up as
$ nl -ba /tmp/out.s | grep __m_MOD_sub1
    12		.type	__m_MOD_sub1, @function
    13	__m_MOD_sub1:
    35		.size	__m_MOD_sub1, .-__m_MOD_sub1
    36		.type	__m_MOD_sub1.avx, @function
    37	__m_MOD_sub1.avx:
    59		.size	__m_MOD_sub1.avx, .-__m_MOD_sub1.avx
    60		.type	__m_MOD_sub1.sse, @function
    61	__m_MOD_sub1.sse:
    83		.size	__m_MOD_sub1.sse, .-__m_MOD_sub1.sse
    84		.section	.text.__m_MOD_sub1.resolver,"axG",@progbits,__m_MOD_sub1.resolver,comdat
    85		.weak	__m_MOD_sub1.resolver
    86		.type	__m_MOD_sub1.resolver, @function
    87	__m_MOD_sub1.resolver:
    95		movl	$__m_MOD_sub1.avx, %eax
   104		movl	$__m_MOD_sub1, %eax
   105		movl	$__m_MOD_sub1.sse, %edx
   110		.size	__m_MOD_sub1.resolver, .-__m_MOD_sub1.resolver
   111		.globl	__m_MOD_sub1
   112		.type	__m_MOD_sub1, @gnu_indirect_function
   113		.set	__m_MOD_sub1,__m_MOD_sub1.resolver

where 13 and 111 probably don't work out too well.
The C frontend uses sub1.default as version for the (former) plain sub1:
     4		.type	sub1.default, @function
     5	sub1.default:
...
   103		.section	.text.sub1.resolver,"axG",@progbits,sub1.resolver,comdat
   105		.weak	sub1.resolver
   106		.type	sub1.resolver, @function
   107	sub1.resolver:
...
   162		leaq	sub1.default(%rip), %rax
   167		.size	sub1.resolver, .-sub1.resolver
   168		.globl	sub1
   169		.type	sub1, @gnu_indirect_function
   170		.set	sub1,sub1.resolver

If i mark the module fndecl as DECL_FUNCTION_VERSIONED, then it's
pointed out that i seem to have to provide the default by hand:
   10 |   subroutine sub1()
      |                 1
internal compiler error: in ix86_mangle_function_version_assembler_name, at config/i386/i386-features.cc:3165
0x806780 ix86_mangle_function_version_assembler_name

That's the check that there is
  /* target attribute string cannot be NULL.  */
  gcc_assert (version_attr != NULL_TREE);
So while target and target_clones seem to be mutually exclusive (from
the C FE checking), the versioning wants the default in a target attr
or something like that.

And on top of all that, gfc_match_gcc_attributes has the following
comment:
   TODO: We should support all GCC attributes using the same syntax for
   the attribute list, i.e. the list in C
      __attributes(( attribute-list ))
   matches then
      !GCC$ ATTRIBUTES attribute-list ::
   Cf. c-parser.cc's c_parser_attributes; the data can then directly be
   saved into a TREE.

When we do that, we can get rid of ext_attr_list[] because that would
be generated right from the start.

I've added a
   /* Attributes set by compiler extensions (!GCC$ ATTRIBUTES).  */
   unsigned ext_attr:EXT_ATTR_NUM;
+  tree ext_attr_args;

to struct symbol_attribute where i can prepare the tree_list for the
attrs right from the start. The lowering is then rather simple and
uniform, just chainon the prepared attributes and be done.

One could get rid of ext_attr altogether, with the caveat that this
would change the module format. We'd have to save the attrs in a
different way, breaking module compat again, of course.

target_clones does not require a bump in the module format, i'd say,
because the main entry point does not change. Will have to check if
the clones do not end up being emitted in the module, they shouldn't be.
Other attributes _may_ require a change in the module format though.
These would need checking on a per case basis.

That said, one cannot import all attributes handling from the C FE into
the fortran FE seamlessly. There is always a bit of massaging required.

next prev parent reply	other threads:[~2022-11-02 23:19 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-28 14:35 Dave Love
2022-10-30  7:48 ` Bernhard Reutner-Fischer
2022-10-31 21:19   ` Dave Love
2022-11-02 23:19     ` Bernhard Reutner-Fischer [this message]
2022-11-04 20:59       ` Bernhard Reutner-Fischer
2022-11-05  7:40         ` Thomas Koenig
2022-11-05 10:54           ` Bernhard Reutner-Fischer
2022-11-06 13:44             ` Thomas Koenig
2022-11-07 11:06               ` Dave Love
2023-02-24 12:24             ` Dave Love
2022-11-07 11:04       ` Dave Love
2022-11-10 12:25         ` Bernhard Reutner-Fischer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20221103001926.725fd9bf@nbbrfq \
    --to=rep.dot.nop@gmail.com \
    --cc=dave.love@manchester.ac.uk \
    --cc=fortran@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).