From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from lxmtout1.gsi.de (lxmtout1.gsi.de [140.181.3.111]) by sourceware.org (Postfix) with ESMTPS id 429D93858282; Wed, 27 Mar 2024 11:53:08 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 429D93858282 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gsi.de Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gsi.de ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 429D93858282 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=140.181.3.111 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1711540391; cv=none; b=scXO0MYcygofw6sfkU24G+rSNecQj+tNHRURDfIoropoDK/CUxgLQowjwyyBlXO1NrEGHJ62Sivqz5d25N9tfHxNurst1CKoY5h7QoPEdITcEL3wV53Kt5LoiX99c9acVkGSpdQrc85EDmsiQI3HEsvrlXI27or3qQsXDtOHi70= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1711540391; c=relaxed/simple; bh=YM4zjy3VGOAMV+PGGcEgb20D4vHKw09ltWwcghJpICc=; h=From:To:Subject:Date:Message-ID:MIME-Version; b=ULBXqPQFEGV2IsKsF8FbIqXvF3sz6e2WhSmQKCA8yxUxrnkRpZqGn44yuM/FXe+iXidYEbaFTP7v/hmbGOPH0xzoklcUnaEfejncE381mwi1UTcSQRhpm14ipPti19mKW3Sz73qEyB/jH6BZtM2d0NTL+umMisQHpZLb6NlDEFI= ARC-Authentication-Results: i=1; server2.sourceware.org Received: from localhost (localhost [127.0.0.1]) by lxmtout1.gsi.de (Postfix) with ESMTP id A917B2051048; Wed, 27 Mar 2024 12:53:06 +0100 (CET) X-Virus-Scanned: Debian amavisd-new at lxmtout1.gsi.de Received: from lxmtout1.gsi.de ([127.0.0.1]) by localhost (lxmtout1.gsi.de [127.0.0.1]) (amavisd-new, port 10024) with LMTP id N_hARXN6BzKY; Wed, 27 Mar 2024 12:53:06 +0100 (CET) Received: from srvEX6.campus.gsi.de (unknown [10.10.4.96]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by lxmtout1.gsi.de (Postfix) with ESMTPS id 8FC0B2051040; Wed, 27 Mar 2024 12:53:06 +0100 (CET) Received: from excalibur.localnet (140.181.3.12) by srvEX6.campus.gsi.de (10.10.4.96) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.9; Wed, 27 Mar 2024 12:53:06 +0100 From: Matthias Kretz To: Matthias Kretz , Srinivas Yadav , , , Subject: Re: [PATCH] libstdc++: add ARM SVE support to std::experimental::simd Date: Wed, 27 Mar 2024 12:53:00 +0100 Message-ID: <4306399.Mh6RI2rZIc@excalibur> Organization: GSI Helmholtz Center for Heavy Ion Research In-Reply-To: References: <3296207.VqM8IeB0Os@centauriprime> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" X-Originating-IP: [140.181.3.12] X-ClientProxiedBy: srvex5.Campus.gsi.de (10.10.4.95) To srvEX6.campus.gsi.de (10.10.4.96) X-Spam-Status: No, score=-4.1 required=5.0 tests=BAYES_00,BODY_8BITS,KAM_DMARC_STATUS,RCVD_IN_MSPIKE_H2,SPF_HELO_PASS,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Hi Richard, sorry for not answering sooner. I took action on your mail but failed to al= so=20 give feedback. Now in light of your veto of Srinivas patch I wanted to use = the=20 opportunity to pick this up again. On Dienstag, 23. Januar 2024 21:57:23 CET Richard Sandiford wrote: > However, we also support different vector lengths for streaming SVE > (running in "streaming" mode on SME) and non-streaming SVE (running > in "non-streaming" mode on the core). Having two different lengths is > expected to be the common case, rather than a theoretical curiosity. I read up on this after you mentioned this for the first time. As a WG21=20 member I find the approach troublesome - but that's a bit off-topic for thi= s=20 thread. The big issue here is that, IIUC, a user (and the simd library) cannot do t= he=20 right thing at the moment. There simply isn't enough context information=20 available when parsing the header. I.e. on definition o= f=20 the class template there's no facility to take target_clones or SME=20 "streaming" mode into account. Consequently, if we want the library to be f= it=20 for SME, then we need more language extension(s) to make it work. I guess I'm looking for a way to declare types that are different depending= on=20 whether they are used in streaming mode or non-streaming mode (making them= =20 ill-formed to use in functions marked arm_streaming_compatible). =46rom reading through https://arm-software.github.io/acle/main/ acle.html#controlling-the-use-of-streaming-mode I don't see any discussion = of=20 member functions or ctor/dtor, static and non-static data members, etc. The big issue I see here is that currently all of std::* is declared withou= t a=20 arm_streaming or arm_streaming_compatible. Thus, IIUC, you can't use anythi= ng=20 from the standard library in streaming mode. Since that also applies to=20 std::experimental::simd, we're not creating a new footgun, only missing out= on=20 potential users? Some more thoughts on target_clones/streaming SVE language extension=20 evolution: void nonstreaming_fn(void) { constexpr int width =3D __arm_sve_bits(); // e.g. 512 constexpr int width2 =3D __builtin_vector_size(); // e.g. 64 (the // vector_size attribute works with bytes, not bits) } __attribute__((arm_locally_streaming)) void streaming_fn(void) { constexpr int width =3D __arm_sve_bits(); // e.g. 128 constexpr int width2 =3D __builtin_vector_size(); // e.g. 16 } __attribute__((target_clones("sse4.2,avx2"))) void streaming_fn(void) { constexpr int width =3D __builtin_vector_size(); // 16 in the sse4.2 cl= one // and 32 in the avx2 clone } =2E.. as a starting point for exploration. Given this, I'd still have to re= sort=20 to a macro to define a "native" simd type: #define NATIVE_SIMD(T) std::experimental::simd> Getting rid of the macro seems to be even harder. A declaration of an alias like template using SveSimd =3D std::experimental::simd>; would have to delay "invoking" __arm_sve_bits() until it knows its context: void nonstreaming_fn(void) { static_assert(sizeof(SveSimd) =3D=3D 64); } __attribute__((arm_locally_streaming)) void streaming_fn(void) { static_assert(sizeof(SveSimd) =3D=3D 16); nonstreaming_fn(); // fine } This gets even worse for target_clones, where void f() { sizeof(std::simd) =3D=3D ? } __attribute__((target_clones("sse4.2,avx2"))) void g() { f(); } the compiler *must* virally apply target_clones to all functions it calls. = And=20 member functions must either also get cloned as functions, or the whole typ= e=20 must be cloned (as in the std::simd case, where the sizeof needs to change)= =2E =F0=9F=98=B3 > When would NumberOfUsedBytes < SizeofRegister be used for SVE? Would it > be for storing narrower elements in wider containers? If the interface > supports that then, yeah, two parameters would probably be safer. >=20 > Or were you thinking about emulating narrower vectors with wider registers > using predication? I suppose that's possible too, and would be similar in > spirit to using SVE to optimise Advanced SIMD std::simd types. > But mightn't it cause confusion if sizeof applied to a "16-byte" > vector actually gives 32? Yes, the idea is to e.g. use one SVE register instead of two NEON registers= =20 for a "float, 8" with SVE512. The user never asks for a "16-byte" vector. The user asks for a value-type = and=20 and number of elements. Sure, the wasteful "padding" might come as a surpri= se,=20 but it's totally within the spec to implement it like this. > I assume std::experimental::native_simd has to have the same > meaning everywhere for ODR reasons? No. Only std::experimental::simd has to be "ABI stable". And note that= in=20 the C++ spec there's no such thing as compiling and linking TUs with differ= ent=20 compiler flags. That's plain UB. The committee still cares about it, but=20 getting this "right" cannot be part of the standard and must be defined by= =20 implementers > If so, it'll need to be an > Advanced SIMD vector for AArch64 (but using SVE to optimise certain > operations under the hood where available). I don't think we could > support anything else. simd on AArch64 uses [[gnu::vector_size(16)]]. > Even if SVE registers are larger than 128 bits, we can't require > all code in the binary to be recompiled with that knowledge. >=20 > I suppose that breaks the "largest" requirement, but due to the > streaming/non-streaming distinction I mentioned above, there isn't > really a single, universal "largest" in this context. There is, but it's context-dependent. I'd love to make this work. =20 > SVE and Advanced SIMD are architected to use the same registers > (i.e. SVE registers architecturally extend Advanced SIMD registers). > In Neoverse V1 (SVE256) they are the same physical register as well. > I believe the same is true for A64FX. That's good to know. =F0=9F=91=8D > FWIW, GCC has already started using SVE in this way. E.g. SVE provides > a wider range of immediate constants for logic operations, so we now use > them for Advanced SIMD logic where beneficial. I will consider these optimizations (when necessary in the library) for the C++26 implementation. Best, Matthias =2D-=20 =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80 Dr. Matthias Kretz https://mattkretz.github.io GSI Helmholtz Center for Heavy Ion Research https://gsi.de std::simd =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80