From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lj1-x22e.google.com (mail-lj1-x22e.google.com [IPv6:2a00:1450:4864:20::22e]) by sourceware.org (Postfix) with ESMTPS id A2CA33861855 for ; Thu, 15 Feb 2024 10:03:51 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org A2CA33861855 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=baylibre.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=baylibre.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org A2CA33861855 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2a00:1450:4864:20::22e ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1707991436; cv=none; b=uJSEyo5sjPxrMpgPCbI7IXlDoIbizkpb2G77mdNQOgK1guSki+0eEjPZQBFj297gboZCNGgpSBYkHu+ty/WCeKyRx45W9QJBAfVHSlqyNAFgP1LYnjV3LQZcTiIK/754zoebfYXL407I+4gtRkGywDD2GGFwC03LVf56hHa+AwA= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1707991436; c=relaxed/simple; bh=cj5NMLOAPdGmqYBTDgCL3QhiMxzGliCIE5vbjVwTNgY=; h=DKIM-Signature:Message-ID:Date:MIME-Version:Subject:To:From; b=ZVIHLq1WqmwYat68OjnwC2WGFL8c8nzYecOhwg5fIx3d6GXoZw7rK9ienZWNyMf+jZGkotuoI/DUn4exYowNkZXor2sS/cCBA3jq6UhmHAaDe3ifWS9oKtww+LHWiyN5b1+p0NhnxFuYOTma2PIVvb+ltNlCwawHXglkRTiOw1U= ARC-Authentication-Results: i=1; server2.sourceware.org Received: by mail-lj1-x22e.google.com with SMTP id 38308e7fff4ca-2d0aabed735so8405641fa.0 for ; Thu, 15 Feb 2024 02:03:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=baylibre-com.20230601.gappssmtp.com; s=20230601; t=1707991430; x=1708596230; darn=gcc.gnu.org; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=L8VESqA/CBiRCtSG7IEuDbQ1MyuT82KwV6v5j3+dyQI=; b=GVbEDnj2te3Sj0JlsWQ6I7L/Z3Wez6gdan3Yf+nMVM/F5kiZsCrHHJvSqb3uL3/O1k enXuj8mhZHFAl5nU72kitwDO4QnZix4e98gG3BOw/SEFpsD9rXutTdnU14Gz7kwNBbB9 QLQ/JZ0CDqHiF+ZOBVTyHvldOYyxITqzbR4dyU0QzDZGuqkeYqaAbakk+MfnunrxC9J9 Z6/3fAqu7ykob4uPTAJbNK1Ei1RwbnhBCmaDKHYkL8tQK+/yGgPrI3ZY3cTuzwcmeckU zwxiH2IPM921L2sMSc9W1jQABkRA3nschiwvW4yS3xZKSiU6091ZTJlzE0khFVygx1N2 YWvA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1707991430; x=1708596230; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=L8VESqA/CBiRCtSG7IEuDbQ1MyuT82KwV6v5j3+dyQI=; b=rKjvJ19xZAxKDFi5IFhogeYQFgUvw0Ky/BSuDwdzw3/3G4j2fasf91FYhI3G2xKZzk DYQ0Zl9BMhOQ2havpaQB5qkbdBGFf7UYR1wuugu/ZJUDaJI2zMnAZrCumfxJxxFCvGKy 42H7lABST8R/Hmd/snZtJdqe7GILM/tvK7GRqPvlvl6+VdFpjH3QlMP537rMreWNeSVY rKg3iWIzVhizUS7SFO2zftc/R0YWLXtk+0mPuqs53kRgOtvtP1eeidojmAXH0yLURKhm FnsSv9JRBcDJbND8HMmCLgz5/UHkJFLOSNmmTYAbkxT+Ge3ULPCgm5wsHyBWbJXn0A8a BzPg== X-Forwarded-Encrypted: i=1; AJvYcCWwNdhnuOrg4t2ruAN83M7YpJNfJg3QSsemGW/IhMqOjNy5XPgwDGnyrZENmrX3hx121TY3jroSCKxhyvf4ucV7kfnXZ6U9VA== X-Gm-Message-State: AOJu0YyRZRvqESe5ldgLpLofbEotwcRCOXLOK/3VoUnZLxWTkjeFcNMf WFSE7sySkonEC2gFSSyFQ36LQO1GrnyXuCvpT2UIonpj0Cmf9go6uqWn9k86uW+RHWj//FeOolD DVtE= X-Google-Smtp-Source: AGHT+IGcBkS/1slS1HbaRM7V8IHae8kxciAJ4gpNodhP5iSBfyeYEL6eP7vqI20io9cnOygE1di+jw== X-Received: by 2002:a05:6512:690:b0:512:8a57:c87a with SMTP id t16-20020a056512069000b005128a57c87amr1032341lfe.7.1707991429794; Thu, 15 Feb 2024 02:03:49 -0800 (PST) Received: from [192.168.0.109] (hawk-18-b2-v4wan-167765-cust1304.vm26.cable.virginm.net. [82.41.69.25]) by smtp.gmail.com with ESMTPSA id u22-20020a05600c211600b0040fddaf9ff4sm4533896wml.40.2024.02.15.02.03.49 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 15 Feb 2024 02:03:49 -0800 (PST) Message-ID: <8d26680d-e206-4d73-9b17-dde12c620e9f@baylibre.com> Date: Thu, 15 Feb 2024 10:03:49 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts" Content-Language: en-GB To: Richard Biener Cc: Thomas Schwinge , gcc-patches@gcc.gnu.org References: <871q9hljxv.fsf@euler.schwinge.ddns.net> <4883749p-27n0-1888-s940-s7q990po84os@fhfr.qr> <6ac85930-3f29-4edb-9f44-db396e1e2845@baylibre.com> <6p14n9p7-8o23-1p90-080p-0nr357q1r409@fhfr.qr> <9so12091-65rs-q9qs-p10r-56sp1o36o3nn@fhfr.qr> <16fc48fa-4228-4932-931b-6cd8a7500218@baylibre.com> <6o6qr517-0866-01r6-9q55-q908043onqnn@fhfr.qr> From: Andrew Stubbs In-Reply-To: <6o6qr517-0866-01r6-9q55-q908043onqnn@fhfr.qr> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=0.4 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,RCVD_IN_BARRACUDACENTRAL,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On 15/02/2024 07:49, Richard Biener wrote: > On Wed, 14 Feb 2024, Andrew Stubbs wrote: > >> On 14/02/2024 13:43, Richard Biener wrote: >>> On Wed, 14 Feb 2024, Andrew Stubbs wrote: >>> >>>> On 14/02/2024 13:27, Richard Biener wrote: >>>>> On Wed, 14 Feb 2024, Andrew Stubbs wrote: >>>>> >>>>>> On 13/02/2024 08:26, Richard Biener wrote: >>>>>>> On Mon, 12 Feb 2024, Thomas Schwinge wrote: >>>>>>> >>>>>>>> Hi! >>>>>>>> >>>>>>>> On 2023-10-20T12:51:03+0100, Andrew Stubbs >>>>>>>> wrote: >>>>>>>>> I've committed this patch >>>>>>>> >>>>>>>> ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691 >>>>>>>> "amdgcn: add -march=gfx1030 EXPERIMENTAL". >>>>>>>> >>>>>>>> The RDNA2 ISA variant doesn't support certain instructions previous >>>>>>>> implemented in GCC/GCN, so a number of patterns etc. had to be >>>>>>>> disabled: >>>>>>>> >>>>>>>>> [...] Vector >>>>>>>>> reductions will need to be reworked for RDNA2. [...] >>>>>>>> >>>>>>>>> * config/gcn/gcn-valu.md (@dpp_move): Disable for RDNA2. >>>>>>>>> (addc3): Add RDNA2 syntax variant. >>>>>>>>> (subc3): Likewise. >>>>>>>>> (2_exec): Add RDNA2 alternatives. >>>>>>>>> (vec_cmpdi): Likewise. >>>>>>>>> (vec_cmpdi): Likewise. >>>>>>>>> (vec_cmpdi_exec): Likewise. >>>>>>>>> (vec_cmpdi_exec): Likewise. >>>>>>>>> (vec_cmpdi_dup): Likewise. >>>>>>>>> (vec_cmpdi_dup_exec): Likewise. >>>>>>>>> (reduc__scal_): Disable for RDNA2. >>>>>>>>> (*_dpp_shr_): Likewise. >>>>>>>>> (*plus_carry_dpp_shr_): Likewise. >>>>>>>>> (*plus_carry_in_dpp_shr_): Likewise. >>>>>>>> >>>>>>>> Etc. The expectation being that GCC middle end copes with this, and >>>>>>>> synthesizes some less ideal yet still functional vector code, I >>>>>>>> presume. >>>>>>>> >>>>>>>> The later RDNA3/gfx1100 support builds on top of this, and that's what >>>>>>>> I'm currently working on getting proper GCC/GCN target (not offloading) >>>>>>>> results for. >>>>>>>> >>>>>>>> I'm seeing a good number of execution test FAILs (regressions compared >>>>>>>> to >>>>>>>> my earlier non-gfx1100 testing), and I've now tracked down where one >>>>>>>> large class of those comes into existance -- not yet how to resolve, >>>>>>>> unfortunately. But maybe, with you guys' combined vectorizer and back >>>>>>>> end experience, the latter will be done quickly? >>>>>>>> >>>>>>>> Richard, I don't know if you've ever run actual GCC/GCN target (not >>>>>>>> offloading) testing; let me know if you have any questions about that. >>>>>>> >>>>>>> I've only done offload testing - in the x86_64 build tree run >>>>>>> check-target-libgomp. If you can tell me how to do GCN target testing >>>>>>> (maybe document it on the wiki even!) I can try do that as well. >>>>>>> >>>>>>>> Given that (at least largely?) the same patterns etc. are disabled as >>>>>>>> in >>>>>>>> my gfx1100 configuration, I suppose your gfx1030 one would exhibit the >>>>>>>> same issues. You can build GCC/GCN target like you build the >>>>>>>> offloading >>>>>>>> one, just remove '--enable-as-accelerator-for=[...]'. Likely, you can >>>>>>>> even use a offloading GCC/GCN build to reproduce the issue below. >>>>>>>> >>>>>>>> One example is the attached 'builtin-bitops-1.c', reduced from >>>>>>>> 'gcc.c-torture/execute/builtin-bitops-1.c', where 'my_popcount' is >>>>>>>> miscompiled as soon as '-ftree-vectorize' is effective: >>>>>>>> >>>>>>>> $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ builtin-bitops-1.c >>>>>>>> -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/ >>>>>>>> -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -fdump-tree-all-all >>>>>>>> -fdump-ipa-all-all -fdump-rtl-all-all -save-temps -march=gfx1100 >>>>>>>> -O1 >>>>>>>> -ftree-vectorize >>>>>>>> >>>>>>>> In the 'diff' of 'a-builtin-bitops-1.c.179t.vect', for example, for >>>>>>>> '-march=gfx90a' vs. '-march=gfx1100', we see: >>>>>>>> >>>>>>>> +builtin-bitops-1.c:7:17: missed: reduc op not supported by >>>>>>>> target. >>>>>>>> >>>>>>>> ..., and therefore: >>>>>>>> >>>>>>>> -builtin-bitops-1.c:7:17: note: Reduce using direct vector >>>>>>>> reduction. >>>>>>>> +builtin-bitops-1.c:7:17: note: Reduce using vector shifts >>>>>>>> +builtin-bitops-1.c:7:17: note: extract scalar result >>>>>>>> >>>>>>>> That is, instead of one '.REDUC_PLUS' for gfx90a, for gfx1100 we build >>>>>>>> a >>>>>>>> chain of summation of 'VEC_PERM_EXPR's. However, there's wrong code >>>>>>>> generated: >>>>>>>> >>>>>>>> $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out >>>>>>>> i=1, ints[i]=0x1 a=1, b=2 >>>>>>>> i=2, ints[i]=0x80000000 a=1, b=2 >>>>>>>> i=3, ints[i]=0x2 a=1, b=2 >>>>>>>> i=4, ints[i]=0x40000000 a=1, b=2 >>>>>>>> i=5, ints[i]=0x10000 a=1, b=2 >>>>>>>> i=6, ints[i]=0x8000 a=1, b=2 >>>>>>>> i=7, ints[i]=0xa5a5a5a5 a=16, b=32 >>>>>>>> i=8, ints[i]=0x5a5a5a5a a=16, b=32 >>>>>>>> i=9, ints[i]=0xcafe0000 a=11, b=22 >>>>>>>> i=10, ints[i]=0xcafe00 a=11, b=22 >>>>>>>> i=11, ints[i]=0xcafe a=11, b=22 >>>>>>>> i=12, ints[i]=0xffffffff a=32, b=64 >>>>>>>> >>>>>>>> (I can't tell if the 'b = 2 * a' pattern is purely coincidental?) >>>>>>>> >>>>>>>> I don't speak enough "vectorization" to fully understand the generic >>>>>>>> vectorized algorithm and its implementation. It appears that the >>>>>>>> "Reduce using vector shifts" code has been around for a very long time, >>>>>>>> but also has gone through a number of changes. I can't tell which GCC >>>>>>>> targets/configurations it's actually used for (in the same way as for >>>>>>>> GCN gfx1100), and thus whether there's an issue in that vectorizer >>>>>>>> code, >>>>>>>> or rather in the GCN back end, or GCN back end parameterizing the >>>>>>>> generic >>>>>>>> code? >>>>>>> >>>>>>> The "shift" reduction is basically doing reduction by repeatedly >>>>>>> adding the upper to the lower half of the vector (each time halving >>>>>>> the vector size). >>>>>>> >>>>>>>> Manually working through the 'a-builtin-bitops-1.c.265t.optimized' >>>>>>>> code: >>>>>>>> >>>>>>>> int my_popcount (unsigned int x) >>>>>>>> { >>>>>>>> int stmp__12.12; >>>>>>>> vector(64) int vect__12.11; >>>>>>>> vector(64) unsigned int vect__1.8; >>>>>>>> vector(64) unsigned int _13; >>>>>>>> vector(64) unsigned int vect_cst__18; >>>>>>>> vector(64) int [all others]; >>>>>>>> >>>>>>>> [local count: 32534376]: >>>>>>>> vect_cst__18 = { [all 'x_8(D)'] }; >>>>>>>> vect__1.8_19 = vect_cst__18 >> { 0, 1, 2, [...], 61, 62, 63 }; >>>>>>>> _13 = .COND_AND ({ [32 x '-1'], [32 x '0'] }, vect__1.8_19, { >>>>>>>> [all >>>>>>>> '1'] }, { [all '0'] }); >>>>>>>> vect__12.11_24 = VIEW_CONVERT_EXPR(_13); >>>>>>>> _26 = VEC_PERM_EXPR >>>>>>> 34, >>>>>>>> [...], 93, 94, 95 }>; >>>>>>>> _27 = vect__12.11_24 + _26; >>>>>>>> _28 = VEC_PERM_EXPR <_27, { [all '0'] }, { 16, 17, 18, [...], >>>>>>>> 77, >>>>>>>> 78, 79 }>; >>>>>>>> _29 = _27 + _28; >>>>>>>> _30 = VEC_PERM_EXPR <_29, { [all '0'] }, { 8, 9, 10, [...], >>>>>>>> 69, >>>>>>>> 70, >>>>>>>> 71 }>; >>>>>>>> _31 = _29 + _30; >>>>>>>> _32 = VEC_PERM_EXPR <_31, { [all '0'] }, { 4, 5, 6, [...], 65, >>>>>>>> 66, >>>>>>>> 67 }>; >>>>>>>> _33 = _31 + _32; >>>>>>>> _34 = VEC_PERM_EXPR <_33, { [all '0'] }, { 2, 3, 4, [...], 63, >>>>>>>> 64, >>>>>>>> 65 }>; >>>>>>>> _35 = _33 + _34; >>>>>>>> _36 = VEC_PERM_EXPR <_35, { [all '0'] }, { 1, 2, 3, [...], 62, >>>>>>>> 63, >>>>>>>> 64 }>; >>>>>>>> _37 = _35 + _36; >>>>>>>> stmp__12.12_38 = BIT_FIELD_REF <_37, 32, 0>; >>>>>>>> return stmp__12.12_38; >>>>>>>> >>>>>>>> ..., for example, for 'x = 7', we get: >>>>>>>> >>>>>>>> vect_cst__18 = { [all '7'] }; >>>>>>>> vect__1.8_19 = { 7, 3, 1, 0, 0, 0, [...] }; >>>>>>>> _13 = { 1, 1, 1, 0, 0, 0, [...] }; >>>>>>>> vect__12.11_24 = { 1, 1, 1, 0, 0, 0, [...] }; >>>>>>>> _26 = { [all '0'] }; >>>>>>>> _27 = { 1, 1, 1, 0, 0, 0, [...] }; >>>>>>>> _28 = { [all '0'] }; >>>>>>>> _29 = { 1, 1, 1, 0, 0, 0, [...] }; >>>>>>>> _30 = { [all '0'] }; >>>>>>>> _31 = { 1, 1, 1, 0, 0, 0, [...] }; >>>>>>>> _32 = { [all '0'] }; >>>>>>>> _33 = { 1, 1, 1, 0, 0, 0, [...] }; >>>>>>>> _34 = { 1, 0, 0, 0, [...] }; >>>>>>>> _35 = { 2, 1, 1, 0, 0, 0, [...] }; >>>>>>>> _36 = { 1, 1, 0, 0, 0, [...] }; >>>>>>>> _37 = { 3, 2, 1, 0, 0, 0, [...] }; >>>>>>>> stmp__12.12_38 = 3; >>>>>>>> return 3; >>>>>>>> >>>>>>>> ..., so the algorithm would appear to synthesize correct code for that >>>>>>>> case. Adding '7' to 'builtin-bitops-1.c', we however again get: >>>>>>>> >>>>>>>> i=13, ints[i]=0x7 a=3, b=6 >>>>>>>> >>>>>>>> >>>>>>>> With the following hack applied to 'gcc/tree-vect-loop.cc': >>>>>>>> >>>>>>>> @@ -6687,8 +6687,9 @@ vect_create_epilog_for_reduction >>>>>>>> (loop_vec_info >>>>>>>> loop_vinfo, >>>>>>>> reduce_with_shift = have_whole_vector_shift (mode1); >>>>>>>> if (!VECTOR_MODE_P (mode1) >>>>>>>> || !directly_supported_p (code, vectype1)) >>>>>>>> reduce_with_shift = false; >>>>>>>> + reduce_with_shift = false; >>>>>>>> >>>>>>>> ..., I'm able to work around those regressions: by means of forcing >>>>>>>> "Reduce using scalar code" instead of "Reduce using vector shifts". >>>>>>> >>>>>>> I would say it somewhere gets broken between the vectorizer and the GPU >>>>>>> which means likely in the target? Can you point out an issue in the >>>>>>> actual generated GCN code? >>>>>>> >>>>>>> Iff this kind of reduction is the issue you'd see quite a lot of >>>>>>> vectorzer execute FAILs. I'm seeing a .COND_AND above - could it >>>>>>> be that the "mask" is still set wrong when doing the reduction >>>>>>> steps? >>>>>> >>>>>> It looks like the ds_bpermute_b32 instruction works differently on RDNA3 >>>>>> (vs. >>>>>> GCN/CDNA and even RDNA2). >>>>>> >>>>>> From the pseudocode in the documentation: >>>>>> >>>>>> for i in 0 : WAVE64 ? 63 : 31 do >>>>>> // ADDR needs to be divided by 4. >>>>>> // High-order bits are ignored. >>>>>> // NOTE: destination lane is MOD 32 regardless of wave size. >>>>>> src_lane = 32'I(VGPR[i][ADDR] + OFFSET.b) / 4 % 32; >>>>>> // EXEC is applied to the source VGPR reads. >>>>>> if EXEC[src_lane].u1 then >>>>>> tmp[i] = VGPR[src_lane][DATA0] >>>>>> endif >>>>>> endfor; >>>>>> >>>>>> The key detail is the "mod 32"; the other architectures have "mod 64" >>>>>> there. >>>>>> >>>>>> So, the last 32 lanes are discarded, and the first 32 lanes are >>>>>> duplicated >>>>>> into the last, and this explains why my_popcount returns double the >>>>>> expected >>>>>> value for smaller inputs. >>>>>> >>>>>> Richi, can you confirm that this testcase works properly on your card, >>>>>> please? >>>>>> >>>>>> To test, assuming you only have the offload toolchain built, compile >>>>>> using >>>>>> x86_64-none-linux-gnu-accel-amdgcn-amdhsa-gcc, which should produce a raw >>>>>> AMD >>>>>> ELF file. Then you run it using "gcn-run a.out" (you can find gcn-run >>>>>> under >>>>>> libexec). >>>>> >>>>> I'm getting >>>>> >>>>> i=1, ints[i]=0x1 a=1, b=2 >>>>> i=2, ints[i]=0x80000000 a=1, b=2 >>>>> i=3, ints[i]=0x2 a=1, b=2 >>>>> i=4, ints[i]=0x40000000 a=1, b=2 >>>>> i=5, ints[i]=0x10000 a=1, b=2 >>>>> i=6, ints[i]=0x8000 a=1, b=2 >>>>> i=7, ints[i]=0xa5a5a5a5 a=16, b=32 >>>>> i=8, ints[i]=0x5a5a5a5a a=16, b=32 >>>>> i=9, ints[i]=0xcafe0000 a=11, b=22 >>>>> i=10, ints[i]=0xcafe00 a=11, b=22 >>>>> i=11, ints[i]=0xcafe a=11, b=22 >>>>> i=12, ints[i]=0xffffffff a=32, b=64 >>>>> >>>>> which I think is the same as Thomas output and thus wrong? >>>>> >>>>> When building with -O0 I get no output. >>>>> >>>>> I'm of course building with -march=gfx1030 >>>> >>>> OK, please try this example, just to check my expectation that your permute >>>> works: >>>> >>>> typedef int v64si __attribute__ ((vector_size (256))); >>>> >>>> int main() >>>> { >>>> v64si permute = { >>>> 40, 40, 40, 40, 40, 40, 40, 40, >>>> 40, 40, 40, 40, 40, 40, 40, 40, >>>> 40, 40, 40, 40, 40, 40, 40, 40, >>>> 40, 40, 40, 40, 40, 40, 40, 40, >>>> 40, 40, 40, 40, 40, 40, 40, 40, >>>> 40, 40, 40, 40, 40, 40, 40, 40, >>>> 40, 40, 40, 40, 40, 40, 40, 40, >>>> 40, 40, 40, 40, 40, 40, 40, 40 >>>> }; >>>> v64si result; >>>> >>>> asm ("ds_bpermute_b32 %0, %1, v1" : "=v"(result) : "v"(permute), >>>> "e"(-1L)); >>>> >>>> for (int i=0; i<63; i++) >>>> __builtin_printf ("%d ", result[i]); >>>> __builtin_printf ("\n"); >>>> >>>> return 0; >>>> } >>>> >>>> On GCN/CDNA devices I expect this to print "10" 64 times. On RDNA3 it >>>> prints >>>> "10" 32 times, and "42" 32 times (which doesn't quite match what I'd expect >>>> from the pseudocode, but does match the written description). Which do you >>>> get? >>> >>> 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 >>> 10 10 10 10 10 10 10 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 >>> 42 42 42 42 42 42 42 42 42 42 42 42 42 >>> >>> so RDNA2 matches RDNA3 here. >> >> OK, that probably is the problem with both our reductions then. The RDNA2 >> manual has the 32-lane wording in the description, but the instruction >> pseudocode lies. :( >> >> I'm now not sure how to implement permute without actually hitting memory? The >> permutation vector is exactly what we'd need to do a gather load from memory >> (not a coincident), but we'd need to find a memory location to do it, ideally >> in the low-latency LDS memory, and it'd have to be thread-safe. >> >> The attached not-well-tested patch should allow only valid permutations. >> Hopefully we go back to working code, but there'll be things that won't >> vectorize. That said, the new "dump" output code has fewer and probably >> cheaper instructions, so hmmm. > > This fixes the reduced builtin-bitops-1.c on RDNA2. > > I suppse if RDNA really only has 32 lane vectors (it sounds like it, > even if it can "simulate" 64 lane ones?) then it might make sense to > vectorize for 32 lanes? That said, with variable-length it likely > doesn't matter but I'd not expose fixed-size modes with 64 lanes then? For most operations, wavefrontsize=64 works just fine; the GPU runs each instruction twice and presents a pair of hardware registers as a logical 64-lane register. This breaks down for permutations and reductions, and is obviously inefficient when they vectors are not fully utilized, but is otherwise compatible with the GCN/CDNA compiler. I didn't want to invest all the effort it would take to support wavefrontsize=32, which would be the natural mode for these devices; the number of places that have "64" hard-coded is just too big. Not only that, but the EXEC and VCC registers change from DImode to SImode and that's going to break a lot of stuff. (And we have no paying customer for this.) I'm open to patch submissions. :) Andrew