From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from lxmtout1.gsi.de (lxmtout1.gsi.de [140.181.3.111]) by sourceware.org (Postfix) with ESMTPS id ADFC13858C56; Thu, 18 Jan 2024 07:28:09 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org ADFC13858C56 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gsi.de Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gsi.de ARC-Filter: OpenARC Filter v1.0.0 sourceware.org ADFC13858C56 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=140.181.3.111 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1705562891; cv=none; b=HROFIy+BULHqhkcisnNyWiQNJETlqizU7wX606U5hkHSeTER8RNp5cxsyD0Z8Clr1hXmFH8zfvFI8gy8cVnC/uZferFldBe0mYa+rya02fs7TVA9KPboiPB+kD3eSTigVyt0Zb2R/INOBj26B4vTCQnJkRUxJKpvXs7rRMZFXT8= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1705562891; c=relaxed/simple; bh=mwsryTfH1TU7xQl7qaQe5XhFhmfUe1kuo/Oq2BBnjFs=; h=From:To:Subject:Date:Message-ID:MIME-Version; b=Lu2WpbT3hAtpv9tfvjSMitgzEpbTdj4tNADvAyrm/QCqb2yHlGINf5glzfo2EsT/8BhINc7So1TtrWVme2KKhVaSRy475BBCrkHFLlBle8Z4m6gDEmSPSBbXLzTRw2viEAbPkBxB4ORiGPK7qzAYQh60cG7lJdkhVR5ZNTUuP/4= ARC-Authentication-Results: i=1; server2.sourceware.org Received: from localhost (localhost [127.0.0.1]) by lxmtout1.gsi.de (Postfix) with ESMTP id D77612051049; Thu, 18 Jan 2024 08:28:08 +0100 (CET) X-Virus-Scanned: Debian amavisd-new at lxmtout1.gsi.de Received: from lxmtout1.gsi.de ([127.0.0.1]) by localhost (lxmtout1.gsi.de [127.0.0.1]) (amavisd-new, port 10024) with LMTP id 2nV1xDBVSsI4; Thu, 18 Jan 2024 08:28:08 +0100 (CET) Received: from srvEX6.campus.gsi.de (unknown [10.10.4.96]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by lxmtout1.gsi.de (Postfix) with ESMTPS id BD8A82051040; Thu, 18 Jan 2024 08:28:08 +0100 (CET) Received: from centauriprime.localnet (140.181.3.12) by srvEX6.campus.gsi.de (10.10.4.96) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.40; Thu, 18 Jan 2024 08:28:08 +0100 From: Matthias Kretz To: Srinivas Yadav , , , Subject: Re: [PATCH] libstdc++: add ARM SVE support to std::experimental::simd Date: Thu, 18 Jan 2024 07:54:32 +0100 Message-ID: <3296207.VqM8IeB0Os@centauriprime> Organization: GSI Helmholtz Centre for Heavy Ion Research In-Reply-To: References: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="UTF-8" X-Originating-IP: [140.181.3.12] X-ClientProxiedBy: srvEX8.Campus.gsi.de (10.10.4.160) To srvEX6.campus.gsi.de (10.10.4.96) X-Spam-Status: No, score=0.1 required=5.0 tests=BAYES_00,BODY_8BITS,INDUSTRIAL_BODY,KAM_DMARC_STATUS,SPF_HELO_PASS,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE,URIBL_SBL_A autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Sunday, 10 December 2023 14:29:45 CET Richard Sandiford wrote: > Thanks for the patch and sorry for the slow review. Sorry for my slow reaction. I needed a long vacation. For now I'll focus on= =20 the design question wrt. multi-arch compilation. > I can only comment on the usage of SVE, rather than on the scaffolding > around it. Hopefully Jonathan or others can comment on the rest. That's very useful! > The main thing that worries me is: >=20 > #if _GLIBCXX_SIMD_HAVE_SVE > constexpr inline int __sve_vectorized_size_bytes =3D __ARM_FEATURE_SVE_BI= TS/8; > #else > constexpr inline int __sve_vectorized_size_bytes =3D 0; > #endif >=20 > Although -msve-vector-bits is currently a per-TU setting, that isn't > necessarily going to be the case in future. This is a topic that I care about a lot... as simd user, implementer, and W= G21=20 proposal author. Are you thinking of a plan to implement the target_clones= =20 function attribute for different SVE lengths? Or does it go further than th= at?=20 PR83875 is raising the same issue and solution ideas for x86. If I understa= nd=20 your concern correctly, then the issue you're raising exists in the same fo= rm=20 for x86. If anyone is interested in working on a "translation phase 7 replacement" f= or=20 compiler flags macros I'd be happy to give some input of what I believe is= =20 necessary to make target_clones work with std(x)::simd. This seems to be ab= out=20 constant expressions that return compiler-internal state - probably similar= to=20 how static reflection needs to work. =46or a sketch of a direction: what I'm already doing in=20 std::experimental::simd, is to tag all non-always_inline function names wit= h a=20 bitflag, representing a relevant subset of -f and -m flags. That way, I'm=20 guarding against surprises on linking TUs compiled with different flags. > Ideally it would be > possible to define different implementations of a function with > different (fixed) vector lengths within the same TU. The value at > the point that the header file is included is then somewhat arbitrary. >=20 > So rather than have: > > using __neon128 =3D _Neon<16>; > > using __neon64 =3D _Neon<8>; > >=20 > > +using __sve =3D _Sve<>; >=20 > would it be possible instead to have: >=20 > using __sve128 =3D _Sve<128>; > using __sve256 =3D _Sve<256>; > ...etc... >=20 > ? Code specialised for 128-bit vectors could then use __sve128 and > code specialised for 256-bit vectors could use __sve256. Hmm, as things stand we'd need two numbers, IIUC: _Sve On x86, "NumberOfUsedBytes" is sufficient, because 33-64 implies zmm regist= ers=20 (and -mavx512f), 17-32 implies ymm, and <=3D16 implies xmm (except where it= =20 doesn't ;) ). > Perhaps that's not possible as things stand, but it'd be interesting > to know the exact failure mode if so. Either way, it would be good to > remove direct uses of __ARM_FEATURE_SVE_BITS from simd_sve.h if possible, > and instead rely on information derived from template parameters. The TS spec requires std::experimental::native_simd to basically give = you=20 the largest, most efficient, full SIMD register. (And it can't be a sizeles= s=20 type because they don't exist in C++). So how would you do that without=20 looking at __ARM_FEATURE_SVE_BITS in the simd implementation? > It should be possible to use SVE to optimise some of the __neon* > implementations, which has the advantage of working even for VLA SVE. > That's obviously a separate patch, though. Just saying for the record. I learned that NVidia Grace CPUs alias NEON and SVE registers. But I must=20 assume that other SVE implementations (especially those with=20 __ARM_FEATURE_SVE_BITS > 128) don't do that and might incur a significant=20 latency when going from a NEON register to an SVE register and back (which= =20 each requires a store-load, IIUC). So are you thinking of implementing=20 everything via SVE? That would break ABI, no? =2D Matthias =2D-=20 =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80 Dr. Matthias Kretz https://mattkretz.github.io GSI Helmholtz Center for Heavy Ion Research https://gsi.de std::simd =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80