From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id E75A63858D20; Wed, 27 Mar 2024 13:34:54 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org E75A63858D20 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org E75A63858D20 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=217.140.110.172 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1711546496; cv=none; b=E1kNFPoS2kxofDrWr+gfbGO/i4V7PlZ9Cps81bJ6baHIy4ODlcjWeG70RTkdsceNmjR08SHlCiCjZvyv91Hn06+QPrsnmVqDwktC8PP9Q9FHpRaCcjl7UCQJpU51Yzi5ao7zmPaL6ZCRRPSyzo4+ZN+zVDYW5/oGn8vU52S1+sg= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1711546496; c=relaxed/simple; bh=UYvjGRHpzkOKSEXzsOSGYNBu0l6oryPw4n0h5J3tv40=; h=From:To:Subject:Date:Message-ID:MIME-Version; b=mxJ+SSVQXdIuSA7hcM4M0LZkphBMZr7m9gyyXCSMhy9St641LrnYdD6IEGUSJNs3YMIBTcr1GP/N5KKjcSEgDvnR5CUU0H9UV36KAP13nJWESre/jANrVSHVAO95REkmV2opUzed06QKsEZRHImKes77TXW+jBW7pgqMeIz85Qo= ARC-Authentication-Results: i=1; server2.sourceware.org Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 9A9D72F4; Wed, 27 Mar 2024 06:35:28 -0700 (PDT) Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id D33303F7BD; Wed, 27 Mar 2024 06:34:53 -0700 (PDT) From: Richard Sandiford To: Matthias Kretz Mail-Followup-To: Matthias Kretz ,Srinivas Yadav , , , richard.sandiford@arm.com Cc: Srinivas Yadav , , Subject: Re: [PATCH] libstdc++: add ARM SVE support to std::experimental::simd References: <3296207.VqM8IeB0Os@centauriprime> <4306399.Mh6RI2rZIc@excalibur> Date: Wed, 27 Mar 2024 13:34:52 +0000 In-Reply-To: <4306399.Mh6RI2rZIc@excalibur> (Matthias Kretz's message of "Wed, 27 Mar 2024 12:53:00 +0100") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-14.7 required=5.0 tests=BAYES_00,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Matthias Kretz writes: > Hi Richard, > > sorry for not answering sooner. I took action on your mail but failed to = also=20 > give feedback. Now in light of your veto of Srinivas patch I wanted to us= e the=20 > opportunity to pick this up again. > > On Dienstag, 23. Januar 2024 21:57:23 CET Richard Sandiford wrote: >> However, we also support different vector lengths for streaming SVE >> (running in "streaming" mode on SME) and non-streaming SVE (running >> in "non-streaming" mode on the core). Having two different lengths is >> expected to be the common case, rather than a theoretical curiosity. > > I read up on this after you mentioned this for the first time. As a WG21= =20 > member I find the approach troublesome - but that's a bit off-topic for t= his=20 > thread. > > The big issue here is that, IIUC, a user (and the simd library) cannot do= the=20 > right thing at the moment. There simply isn't enough context information= =20 > available when parsing the header. I.e. on definition= of=20 > the class template there's no facility to take target_clones or SME=20 > "streaming" mode into account. Consequently, if we want the library to be= fit=20 > for SME, then we need more language extension(s) to make it work. Yeah. I think the same applies to plain SVE. It seems reasonable to have functions whose implementation is specialised for a specific SVE length, with that function being selected at runtime where appropriate. Those functions needn't (in principle) be in separate TUs. The =E2=80=9Cbe= st=E2=80=9D definition of native then becomes a per-function property rather than a per-TU property. As you note later, I think the same thing would apply to x86_64. > I guess I'm looking for a way to declare types that are different dependi= ng on=20 > whether they are used in streaming mode or non-streaming mode (making the= m=20 > ill-formed to use in functions marked arm_streaming_compatible). > > From reading through https://arm-software.github.io/acle/main/ > acle.html#controlling-the-use-of-streaming-mode I don't see any discussio= n of=20 > member functions or ctor/dtor, static and non-static data members, etc. > > The big issue I see here is that currently all of std::* is declared with= out a=20 > arm_streaming or arm_streaming_compatible. Thus, IIUC, you can't use anyt= hing=20 > from the standard library in streaming mode. Since that also applies to=20 > std::experimental::simd, we're not creating a new footgun, only missing o= ut on=20 > potential users? Kind-of. However, we can inline a non-streaming function into a streaming function if that doesn't change defined behaviour. And that's important in practice for C++, since most trivial inline functions will not be marked streaming-compatible despite being so in practice. It's UB to pass and return SVE vectors across streaming/non-streaming boundaries unless the two VLs are equal. It's therefore valid to inline such functions into streaming functions *unless* the callee uses non-streaming-only instructions such as gather loads. Because of that, someone trying to use std::experimenal::simd in SME functions is likely to succeed, at least in simple cases. > Some more thoughts on target_clones/streaming SVE language extension=20 > evolution: > > void nonstreaming_fn(void) { > constexpr int width =3D __arm_sve_bits(); // e.g. 512 > constexpr int width2 =3D __builtin_vector_size(); // e.g. 64 (the > // vector_size attribute works with bytes, not bits) > } > > __attribute__((arm_locally_streaming)) > void streaming_fn(void) { > constexpr int width =3D __arm_sve_bits(); // e.g. 128 > constexpr int width2 =3D __builtin_vector_size(); // e.g. 16 > } > > __attribute__((target_clones("sse4.2,avx2"))) > void streaming_fn(void) { > constexpr int width =3D __builtin_vector_size(); // 16 in the sse4.2 = clone > // and 32 in the avx2 clone > } > > ... as a starting point for exploration. Given this, I'd still have to re= sort=20 > to a macro to define a "native" simd type: > > #define NATIVE_SIMD(T) std::experimental::simd CHAR_BITS, __arm_sve_bits() / CHAR_BITS>> > > Getting rid of the macro seems to be even harder. Yeah. The constexprs in the AArch64 functions would only be compile-time constants in to-be-defined circumstances, using some mechanism to specify the streaming and non-streaming vector lengths at compile time. But that's a premise of the whole discussion, just noting it for the record in case anyone reading this later jumps in at this point. > A declaration of an alias like > > template > using SveSimd =3D std::experimental::simd CHAR_BITS, __arm_sve_bits() / CHAR_BITS>>; > > would have to delay "invoking" __arm_sve_bits() until it knows its contex= t: > > void nonstreaming_fn(void) { > static_assert(sizeof(SveSimd) =3D=3D 64); > } > > __attribute__((arm_locally_streaming)) > void streaming_fn(void) { > static_assert(sizeof(SveSimd) =3D=3D 16); > nonstreaming_fn(); // fine > } > > This gets even worse for target_clones, where > > void f() { > sizeof(std::simd) =3D=3D ? > } > > __attribute__((target_clones("sse4.2,avx2"))) > void g() { > f(); > } > > the compiler *must* virally apply target_clones to all functions it calls= . And=20 > member functions must either also get cloned as functions, or the whole t= ype=20 > must be cloned (as in the std::simd case, where the sizeof needs to chang= e). =F0=9F=98=B3 Yeah, tricky :) It's also not just about vector widths. The target-clones case also has the problem that you cannot detect at include time which features are available. E.g. =E2=80=9Cdo I have SVE2-specific instructions?=E2=80=9D be= comes a contextual question rather than a global question. Fortunately, this should just be a missed optimisation. But it would be nice if uses of std::simd in SVE2 clones could take advantage of SVE2-only instructions, even if SVE2 wasn't enabled at include time. Thanks for the other (snipped) clarifications. They were really helpful, but I didn't have anything to add. Richard