From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id 8636E3858415; Tue, 23 Jan 2024 20:57:25 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 8636E3858415 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 8636E3858415 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=217.140.110.172 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1706043448; cv=none; b=c7/oxpKxbTK31mcVusX3WLxlQLoKdaHbyMZMo1X6OwsUQ7SeXCYnnMmcvM7ilnrcJawde8fCvr2op+NdBp3ifJkpPoUq42mrAS3/83z4w9VZjha/c0Lmd31Hm6pujmzvMmRHOw5Qfdi9gago5drMcb9RMl7ujSDzBVzdZwQZakA= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1706043448; c=relaxed/simple; bh=TFIPao7O7PZhBny8B6kggJffQJaHvR5KP7dlcdScXAE=; h=From:To:Subject:Date:Message-ID:MIME-Version; b=WhWHq1Qnn1GzQ0bXzSc2hI7823UtK4YlBmMi3yN9R9N9i9xX/Bq+MM33knrncLFTPUC1lOahHNoRlhqnUp4b7/gjmGEwfwq2P0e9gbZxu17p47SVGZhWwh8KMePtUQS6x4aL1NlG930NvLa8PMKCx0N0i34j4yeZWDFpM3ndWQ4= ARC-Authentication-Results: i=1; server2.sourceware.org Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 4ABAC1FB; Tue, 23 Jan 2024 12:58:10 -0800 (PST) Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 5A1623F5A1; Tue, 23 Jan 2024 12:57:24 -0800 (PST) From: Richard Sandiford To: Matthias Kretz Mail-Followup-To: Matthias Kretz ,Srinivas Yadav , , , richard.sandiford@arm.com Cc: Srinivas Yadav , , Subject: Re: [PATCH] libstdc++: add ARM SVE support to std::experimental::simd References: <3296207.VqM8IeB0Os@centauriprime> Date: Tue, 23 Jan 2024 20:57:23 +0000 In-Reply-To: <3296207.VqM8IeB0Os@centauriprime> (Matthias Kretz's message of "Thu, 18 Jan 2024 07:54:32 +0100") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Spam-Status: No, score=-12.3 required=5.0 tests=BAYES_00,INDUSTRIAL_BODY,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Matthias Kretz writes: > On Sunday, 10 December 2023 14:29:45 CET Richard Sandiford wrote: >> Thanks for the patch and sorry for the slow review. > > Sorry for my slow reaction. I needed a long vacation. For now I'll focus on > the design question wrt. multi-arch compilation. > >> I can only comment on the usage of SVE, rather than on the scaffolding >> around it. Hopefully Jonathan or others can comment on the rest. > > That's very useful! > >> The main thing that worries me is: >> >> #if _GLIBCXX_SIMD_HAVE_SVE >> constexpr inline int __sve_vectorized_size_bytes = __ARM_FEATURE_SVE_BITS/8; >> #else >> constexpr inline int __sve_vectorized_size_bytes = 0; >> #endif >> >> Although -msve-vector-bits is currently a per-TU setting, that isn't >> necessarily going to be the case in future. > > This is a topic that I care about a lot... as simd user, implementer, and WG21 > proposal author. Are you thinking of a plan to implement the target_clones > function attribute for different SVE lengths? Or does it go further than that? > PR83875 is raising the same issue and solution ideas for x86. If I understand > your concern correctly, then the issue you're raising exists in the same form > for x86. > > If anyone is interested in working on a "translation phase 7 replacement" for > compiler flags macros I'd be happy to give some input of what I believe is > necessary to make target_clones work with std(x)::simd. This seems to be about > constant expressions that return compiler-internal state - probably similar to > how static reflection needs to work. > > For a sketch of a direction: what I'm already doing in > std::experimental::simd, is to tag all non-always_inline function names with a > bitflag, representing a relevant subset of -f and -m flags. That way, I'm > guarding against surprises on linking TUs compiled with different flags. Yeah, target_clones could be one use for switching lengths. In that case the choice would be based on a load-time check of the vector length. However, we also support different vector lengths for streaming SVE (running in "streaming" mode on SME) and non-streaming SVE (running in "non-streaming" mode on the core). Having two different lengths is expected to be the common case, rather than a theoretical curiosity. Software also provides a "streaming compatible" mode in which code can run either in streaming or non-streaming mode and which restricts itself to the common subset of non-streaming and streaming features (which is large). So it isn't really safe to treat the vector length as a program or process invariant, or to assume that all active uses of std::simd in a binary are for the same vector length. It should in principle be possible to use std::simd variants for non-streaming code and std::simd variants for streaming code even if the lengths are different. So in the target_clones case, the clone would have to apply to a non-streaming function and switch based on the non-streaming vector length, or apply to a streaming function and switch based on the streaming vector length. (Neither of those are supported yet, of course.) I think there are probably also ODR problems with hard-coding a vector length, instead of treating it as a template parameter. Unfortunately it isn't currently possible to write: template using T = svint32_t __attribute__((arm_sve_vector_bits(N))); due to PRs 58855/68703/88600. But I think we should fix that. :) >> Ideally it would be >> possible to define different implementations of a function with >> different (fixed) vector lengths within the same TU. The value at >> the point that the header file is included is then somewhat arbitrary. >> >> So rather than have: >> > using __neon128 = _Neon<16>; >> > using __neon64 = _Neon<8>; >> > >> > +using __sve = _Sve<>; >> >> would it be possible instead to have: >> >> using __sve128 = _Sve<128>; >> using __sve256 = _Sve<256>; >> ...etc... >> >> ? Code specialised for 128-bit vectors could then use __sve128 and >> code specialised for 256-bit vectors could use __sve256. > > Hmm, as things stand we'd need two numbers, IIUC: > _Sve > > On x86, "NumberOfUsedBytes" is sufficient, because 33-64 implies zmm registers > (and -mavx512f), 17-32 implies ymm, and <=16 implies xmm (except where it > doesn't ;) ). When would NumberOfUsedBytes < SizeofRegister be used for SVE? Would it be for storing narrower elements in wider containers? If the interface supports that then, yeah, two parameters would probably be safer. Or were you thinking about emulating narrower vectors with wider registers using predication? I suppose that's possible too, and would be similar in spirit to using SVE to optimise Advanced SIMD std::simd types. But mightn't it cause confusion if sizeof applied to a "16-byte" vector actually gives 32? >> Perhaps that's not possible as things stand, but it'd be interesting >> to know the exact failure mode if so. Either way, it would be good to >> remove direct uses of __ARM_FEATURE_SVE_BITS from simd_sve.h if possible, >> and instead rely on information derived from template parameters. > > The TS spec requires std::experimental::native_simd to basically give you > the largest, most efficient, full SIMD register. (And it can't be a sizeless > type because they don't exist in C++). So how would you do that without > looking at __ARM_FEATURE_SVE_BITS in the simd implementation? I assume std::experimental::native_simd has to have the same meaning everywhere for ODR reasons? If so, it'll need to be an Advanced SIMD vector for AArch64 (but using SVE to optimise certain operations under the hood where available). I don't think we could support anything else. Even if SVE registers are larger than 128 bits, we can't require all code in the binary to be recompiled with that knowledge. I suppose that breaks the "largest" requirement, but due to the streaming/non-streaming distinction I mentioned above, there isn't really a single, universal "largest" in this context. >> It should be possible to use SVE to optimise some of the __neon* >> implementations, which has the advantage of working even for VLA SVE. >> That's obviously a separate patch, though. Just saying for the record. > > I learned that NVidia Grace CPUs alias NEON and SVE registers. But I must > assume that other SVE implementations (especially those with > __ARM_FEATURE_SVE_BITS > 128) don't do that and might incur a significant > latency when going from a NEON register to an SVE register and back (which > each requires a store-load, IIUC). So are you thinking of implementing > everything via SVE? That would break ABI, no? SVE and Advanced SIMD are architected to use the same registers (i.e. SVE registers architecturally extend Advanced SIMD registers). In Neoverse V1 (SVE256) they are the same physical register as well. I believe the same is true for A64FX. FWIW, GCC has already started using SVE in this way. E.g. SVE provides a wider range of immediate constants for logic operations, so we now use them for Advanced SIMD logic where beneficial. Thanks, Richard