From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from lxmtout2.gsi.de (lxmtout2.gsi.de [140.181.3.112]) by sourceware.org (Postfix) with ESMTPS id DD5503858D1E; Thu, 28 Mar 2024 14:53:19 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org DD5503858D1E Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gsi.de Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gsi.de ARC-Filter: OpenARC Filter v1.0.0 sourceware.org DD5503858D1E Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=140.181.3.112 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1711637602; cv=none; b=qDBimq39G4mr7dsNDud+XzTtVHhoN4Ng7kJedlIoSP48fw71JwNBCpn7iw6q7te0urrZG83dWZkZoyclYNO3aqRNiMtmmc0GcrM4ag4l3sa+otdxaD1MIKOFx+Z6KuaUWpL2fWIRopWX0WxCXpwoKTk1oLR6Uu1+DxEA69mr52k= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1711637602; c=relaxed/simple; bh=rHK+TzpJ9043vmgQQ4ZVD013qdxOohvcoKP6jCcrozk=; h=From:To:Subject:Date:Message-ID:MIME-Version; b=PKixXAoZS2wylmg3tmLrhqKb/Z3lc3s7zyK6WGyad1x39Sm5/X2Hdn2cmcenln3nlT0i9eN3OOOPwsTiYAaHuMlvuPcXOKzWdRWfGHaVntovnyZ3OgURwnPCyGCwQAd6Pi9Xgdh2avFtCxW8rz9SkYh2ovmLhcTx2IGp91VQr5U= ARC-Authentication-Results: i=1; server2.sourceware.org Received: from localhost (localhost [127.0.0.1]) by lxmtout2.gsi.de (Postfix) with ESMTP id D721220350EF; Thu, 28 Mar 2024 15:53:18 +0100 (CET) X-Virus-Scanned: Debian amavisd-new at lxmtout2.gsi.de Received: from lxmtout2.gsi.de ([127.0.0.1]) by localhost (lxmtout2.gsi.de [127.0.0.1]) (amavisd-new, port 10024) with LMTP id 1RmZtZh9CRby; Thu, 28 Mar 2024 15:53:18 +0100 (CET) Received: from srvEX6.campus.gsi.de (unknown [10.10.4.96]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by lxmtout2.gsi.de (Postfix) with ESMTPS id B5CB420350EC; Thu, 28 Mar 2024 15:53:18 +0100 (CET) Received: from excalibur.localnet (140.181.3.12) by srvEX6.campus.gsi.de (10.10.4.96) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.9; Thu, 28 Mar 2024 15:53:18 +0100 From: Matthias Kretz To: Matthias Kretz , Srinivas Yadav , , , Subject: Re: [PATCH] libstdc++: add ARM SVE support to std::experimental::simd Date: Thu, 28 Mar 2024 15:48:21 +0100 Message-ID: <25137674.EfDdHjke4D@excalibur> Organization: GSI Helmholtz Center for Heavy Ion Research In-Reply-To: References: <4306399.Mh6RI2rZIc@excalibur> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" X-Originating-IP: [140.181.3.12] X-ClientProxiedBy: srvEX6.Campus.gsi.de (10.10.4.96) To srvEX6.campus.gsi.de (10.10.4.96) X-Spam-Status: No, score=-4.1 required=5.0 tests=BAYES_00,BODY_8BITS,KAM_DMARC_STATUS,SPF_HELO_PASS,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Mittwoch, 27. M=C3=A4rz 2024 14:34:52 CET Richard Sandiford wrote: > Matthias Kretz writes: > > The big issue here is that, IIUC, a user (and the simd library) cannot = do > > the right thing at the moment. There simply isn't enough context > > information available when parsing the header. I.e. > > on definition of the class template there's no facility to take > > target_clones or SME "streaming" mode into account. Consequently, if we > > want the library to be fit for SME, then we need more language > > extension(s) to make it work. >=20 > Yeah. I think the same applies to plain SVE. With "plain SVE" you mean the *scalable* part of it, right? BTW, I've=20 experimented with implementing simd basically as template class simd { alignas(bit_ceil(sizeof(T) * N)) T data[N]; See here: https://compiler-explorer.com/z/WW6KqanTW Maybe the compiler can get better at optimizing this approach. But for now= =20 it's not a solution for a *scalable* variant, because every code is going t= o=20 be load/store bound from the get go. @Srinivas: See the guard variables for __index0123? They need to go. I beli= eve=20 you can and should declare them `constexpr`. > It seems reasonable to > have functions whose implementation is specialised for a specific SVE > length, with that function being selected at runtime where appropriate. > Those functions needn't (in principle) be in separate TUs. The =E2=80=9C= best=E2=80=9D > definition of native then becomes a per-function property rather > than a per-TU property. Hmm, I never considered this; but can one actually write fixed-length SVE c= ode=20 if -msve-vector-bits is not given? Then it's certainly possible to write a= =20 single TU with a runtime dispatch for all different SVE-widths. (This is le= ss=20 interesting on x86 where we need to dispatch on ISA extensions *and* vector= =20 width. It's much simpler (and safer) to compile a TU multiple times,=20 restricted to a certain set of ISA extensions and then dispatch to the righ= t=20 translation at from some general code section.) > As you note later, I think the same thing would apply to x86_64. Yes. I don't think "same" is the case (yet) but it's very similar. Once ARM= is=20 at SVE9 =F0=9F=98=89 and binaries need to support HW from SVE2 up to SVE9 i= t gets closer=20 to "same". > > The big issue I see here is that currently all of std::* is declared > > without a arm_streaming or arm_streaming_compatible. Thus, IIUC, you > > can't use anything from the standard library in streaming mode. Since > > that also applies to std::experimental::simd, we're not creating a new > > footgun, only missing out on potential users? >=20 > Kind-of. However, we can inline a non-streaming function into a streaming > function if that doesn't change defined behaviour. And that's important > in practice for C++, since most trivial inline functions will not be > marked streaming-compatible despite being so in practice. Ah good to know that it takes a pragmatic approach here. But I imagine this= =20 could become a source of confusion to users. > > [...] > > the compiler *must* virally apply target_clones to all functions it cal= ls. > > And member functions must either also get cloned as functions, or the > > whole type must be cloned (as in the std::simd case, where the sizeof > > needs to change). =F0=9F=98=B3 > Yeah, tricky :) >=20 > It's also not just about vector widths. The target-clones case also has > the problem that you cannot detect at include time which features are > available. E.g. =E2=80=9Cdo I have SVE2-specific instructions?=E2=80=9D = becomes a > contextual question rather than a global question. >=20 > Fortunately, this should just be a missed optimisation. But it would be > nice if uses of std::simd in SVE2 clones could take advantage of SVE2-only > instructions, even if SVE2 wasn't enabled at include time. Exactly. Even if we solve the scalable vector-length question, the=20 target_clones question stays relevant. So far my best answer, for x86 at least, is to compile the SIMD code multip= le=20 times into different shared libraries. And then let the dynamic linker pick= =20 the right library variant depending on the CPU. I'd be happy to have someth= ing=20 simpler and working right out of the box. Best, Matthias =2D-=20 =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80 Dr. Matthias Kretz https://mattkretz.github.io GSI Helmholtz Center for Heavy Ion Research https://gsi.de std::simd =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80