From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=inet=JY=baylibre.com=ams@sourceware.org>
Received: from mail-lj1-x22e.google.com (mail-lj1-x22e.google.com [IPv6:2a00:1450:4864:20::22e])
	by sourceware.org (Postfix) with ESMTPS id A2CA33861855
	for <gcc-patches@gcc.gnu.org>; Thu, 15 Feb 2024 10:03:51 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org A2CA33861855
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=baylibre.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=baylibre.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org A2CA33861855
Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2a00:1450:4864:20::22e
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1707991436; cv=none;
	b=uJSEyo5sjPxrMpgPCbI7IXlDoIbizkpb2G77mdNQOgK1guSki+0eEjPZQBFj297gboZCNGgpSBYkHu+ty/WCeKyRx45W9QJBAfVHSlqyNAFgP1LYnjV3LQZcTiIK/754zoebfYXL407I+4gtRkGywDD2GGFwC03LVf56hHa+AwA=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
	t=1707991436; c=relaxed/simple;
	bh=cj5NMLOAPdGmqYBTDgCL3QhiMxzGliCIE5vbjVwTNgY=;
	h=DKIM-Signature:Message-ID:Date:MIME-Version:Subject:To:From; b=ZVIHLq1WqmwYat68OjnwC2WGFL8c8nzYecOhwg5fIx3d6GXoZw7rK9ienZWNyMf+jZGkotuoI/DUn4exYowNkZXor2sS/cCBA3jq6UhmHAaDe3ifWS9oKtww+LHWiyN5b1+p0NhnxFuYOTma2PIVvb+ltNlCwawHXglkRTiOw1U=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: by mail-lj1-x22e.google.com with SMTP id 38308e7fff4ca-2d0aabed735so8405641fa.0
        for <gcc-patches@gcc.gnu.org>; Thu, 15 Feb 2024 02:03:51 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=baylibre-com.20230601.gappssmtp.com; s=20230601; t=1707991430; x=1708596230; darn=gcc.gnu.org;
        h=content-transfer-encoding:in-reply-to:from:references:cc:to
         :content-language:subject:user-agent:mime-version:date:message-id
         :from:to:cc:subject:date:message-id:reply-to;
        bh=L8VESqA/CBiRCtSG7IEuDbQ1MyuT82KwV6v5j3+dyQI=;
        b=GVbEDnj2te3Sj0JlsWQ6I7L/Z3Wez6gdan3Yf+nMVM/F5kiZsCrHHJvSqb3uL3/O1k
         enXuj8mhZHFAl5nU72kitwDO4QnZix4e98gG3BOw/SEFpsD9rXutTdnU14Gz7kwNBbB9
         QLQ/JZ0CDqHiF+ZOBVTyHvldOYyxITqzbR4dyU0QzDZGuqkeYqaAbakk+MfnunrxC9J9
         Z6/3fAqu7ykob4uPTAJbNK1Ei1RwbnhBCmaDKHYkL8tQK+/yGgPrI3ZY3cTuzwcmeckU
         zwxiH2IPM921L2sMSc9W1jQABkRA3nschiwvW4yS3xZKSiU6091ZTJlzE0khFVygx1N2
         YWvA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1707991430; x=1708596230;
        h=content-transfer-encoding:in-reply-to:from:references:cc:to
         :content-language:subject:user-agent:mime-version:date:message-id
         :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=L8VESqA/CBiRCtSG7IEuDbQ1MyuT82KwV6v5j3+dyQI=;
        b=rKjvJ19xZAxKDFi5IFhogeYQFgUvw0Ky/BSuDwdzw3/3G4j2fasf91FYhI3G2xKZzk
         DYQ0Zl9BMhOQ2havpaQB5qkbdBGFf7UYR1wuugu/ZJUDaJI2zMnAZrCumfxJxxFCvGKy
         42H7lABST8R/Hmd/snZtJdqe7GILM/tvK7GRqPvlvl6+VdFpjH3QlMP537rMreWNeSVY
         rKg3iWIzVhizUS7SFO2zftc/R0YWLXtk+0mPuqs53kRgOtvtP1eeidojmAXH0yLURKhm
         FnsSv9JRBcDJbND8HMmCLgz5/UHkJFLOSNmmTYAbkxT+Ge3ULPCgm5wsHyBWbJXn0A8a
         BzPg==
X-Forwarded-Encrypted: i=1; AJvYcCWwNdhnuOrg4t2ruAN83M7YpJNfJg3QSsemGW/IhMqOjNy5XPgwDGnyrZENmrX3hx121TY3jroSCKxhyvf4ucV7kfnXZ6U9VA==
X-Gm-Message-State: AOJu0YyRZRvqESe5ldgLpLofbEotwcRCOXLOK/3VoUnZLxWTkjeFcNMf
	WFSE7sySkonEC2gFSSyFQ36LQO1GrnyXuCvpT2UIonpj0Cmf9go6uqWn9k86uW+RHWj//FeOolD
	DVtE=
X-Google-Smtp-Source: AGHT+IGcBkS/1slS1HbaRM7V8IHae8kxciAJ4gpNodhP5iSBfyeYEL6eP7vqI20io9cnOygE1di+jw==
X-Received: by 2002:a05:6512:690:b0:512:8a57:c87a with SMTP id t16-20020a056512069000b005128a57c87amr1032341lfe.7.1707991429794;
        Thu, 15 Feb 2024 02:03:49 -0800 (PST)
Received: from [192.168.0.109] (hawk-18-b2-v4wan-167765-cust1304.vm26.cable.virginm.net. [82.41.69.25])
        by smtp.gmail.com with ESMTPSA id u22-20020a05600c211600b0040fddaf9ff4sm4533896wml.40.2024.02.15.02.03.49
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Thu, 15 Feb 2024 02:03:49 -0800 (PST)
Message-ID: <8d26680d-e206-4d73-9b17-dde12c620e9f@baylibre.com>
Date: Thu, 15 Feb 2024 10:03:49 +0000
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts"
Content-Language: en-GB
To: Richard Biener <rguenther@suse.de>
Cc: Thomas Schwinge <tschwinge@baylibre.com>, gcc-patches@gcc.gnu.org
References: <b99badf1-895a-4ff1-b68f-abc665a85625@codesourcery.com>
 <871q9hljxv.fsf@euler.schwinge.ddns.net>
 <4883749p-27n0-1888-s940-s7q990po84os@fhfr.qr>
 <6ac85930-3f29-4edb-9f44-db396e1e2845@baylibre.com>
 <6p14n9p7-8o23-1p90-080p-0nr357q1r409@fhfr.qr>
 <b1ddece7-f05c-4350-93f1-a77a7003002b@baylibre.com>
 <9so12091-65rs-q9qs-p10r-56sp1o36o3nn@fhfr.qr>
 <16fc48fa-4228-4932-931b-6cd8a7500218@baylibre.com>
 <6o6qr517-0866-01r6-9q55-q908043onqnn@fhfr.qr>
From: Andrew Stubbs <ams@baylibre.com>
In-Reply-To: <6o6qr517-0866-01r6-9q55-q908043onqnn@fhfr.qr>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Spam-Status: No, score=0.4 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,RCVD_IN_BARRACUDACENTRAL,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On 15/02/2024 07:49, Richard Biener wrote:
> On Wed, 14 Feb 2024, Andrew Stubbs wrote:
> 
>> On 14/02/2024 13:43, Richard Biener wrote:
>>> On Wed, 14 Feb 2024, Andrew Stubbs wrote:
>>>
>>>> On 14/02/2024 13:27, Richard Biener wrote:
>>>>> On Wed, 14 Feb 2024, Andrew Stubbs wrote:
>>>>>
>>>>>> On 13/02/2024 08:26, Richard Biener wrote:
>>>>>>> On Mon, 12 Feb 2024, Thomas Schwinge wrote:
>>>>>>>
>>>>>>>> Hi!
>>>>>>>>
>>>>>>>> On 2023-10-20T12:51:03+0100, Andrew Stubbs <ams@codesourcery.com>
>>>>>>>> wrote:
>>>>>>>>> I've committed this patch
>>>>>>>>
>>>>>>>> ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
>>>>>>>> "amdgcn: add -march=gfx1030 EXPERIMENTAL".
>>>>>>>>
>>>>>>>> The RDNA2 ISA variant doesn't support certain instructions previous
>>>>>>>> implemented in GCC/GCN, so a number of patterns etc. had to be
>>>>>>>> disabled:
>>>>>>>>
>>>>>>>>> [...] Vector
>>>>>>>>> reductions will need to be reworked for RDNA2.  [...]
>>>>>>>>
>>>>>>>>>     * config/gcn/gcn-valu.md (@dpp_move<mode>): Disable for RDNA2.
>>>>>>>>>     (addc<mode>3<exec_vcc>): Add RDNA2 syntax variant.
>>>>>>>>>     (subc<mode>3<exec_vcc>): Likewise.
>>>>>>>>>     (<convop><mode><vndi>2_exec): Add RDNA2 alternatives.
>>>>>>>>>     (vec_cmp<mode>di): Likewise.
>>>>>>>>>     (vec_cmp<u><mode>di): Likewise.
>>>>>>>>>     (vec_cmp<mode>di_exec): Likewise.
>>>>>>>>>     (vec_cmp<u><mode>di_exec): Likewise.
>>>>>>>>>     (vec_cmp<mode>di_dup): Likewise.
>>>>>>>>>     (vec_cmp<mode>di_dup_exec): Likewise.
>>>>>>>>>     (reduc_<reduc_op>_scal_<mode>): Disable for RDNA2.
>>>>>>>>>     (*<reduc_op>_dpp_shr_<mode>): Likewise.
>>>>>>>>>     (*plus_carry_dpp_shr_<mode>): Likewise.
>>>>>>>>>     (*plus_carry_in_dpp_shr_<mode>): Likewise.
>>>>>>>>
>>>>>>>> Etc.  The expectation being that GCC middle end copes with this, and
>>>>>>>> synthesizes some less ideal yet still functional vector code, I
>>>>>>>> presume.
>>>>>>>>
>>>>>>>> The later RDNA3/gfx1100 support builds on top of this, and that's what
>>>>>>>> I'm currently working on getting proper GCC/GCN target (not offloading)
>>>>>>>> results for.
>>>>>>>>
>>>>>>>> I'm seeing a good number of execution test FAILs (regressions compared
>>>>>>>> to
>>>>>>>> my earlier non-gfx1100 testing), and I've now tracked down where one
>>>>>>>> large class of those comes into existance -- not yet how to resolve,
>>>>>>>> unfortunately.  But maybe, with you guys' combined vectorizer and back
>>>>>>>> end experience, the latter will be done quickly?
>>>>>>>>
>>>>>>>> Richard, I don't know if you've ever run actual GCC/GCN target (not
>>>>>>>> offloading) testing; let me know if you have any questions about that.
>>>>>>>
>>>>>>> I've only done offload testing - in the x86_64 build tree run
>>>>>>> check-target-libgomp.  If you can tell me how to do GCN target testing
>>>>>>> (maybe document it on the wiki even!) I can try do that as well.
>>>>>>>
>>>>>>>> Given that (at least largely?) the same patterns etc. are disabled as
>>>>>>>> in
>>>>>>>> my gfx1100 configuration, I suppose your gfx1030 one would exhibit the
>>>>>>>> same issues.  You can build GCC/GCN target like you build the
>>>>>>>> offloading
>>>>>>>> one, just remove '--enable-as-accelerator-for=[...]'.  Likely, you can
>>>>>>>> even use a offloading GCC/GCN build to reproduce the issue below.
>>>>>>>>
>>>>>>>> One example is the attached 'builtin-bitops-1.c', reduced from
>>>>>>>> 'gcc.c-torture/execute/builtin-bitops-1.c', where 'my_popcount' is
>>>>>>>> miscompiled as soon as '-ftree-vectorize' is effective:
>>>>>>>>
>>>>>>>>         $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ builtin-bitops-1.c
>>>>>>>>         -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/
>>>>>>>>         -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -fdump-tree-all-all
>>>>>>>>         -fdump-ipa-all-all -fdump-rtl-all-all -save-temps -march=gfx1100
>>>>>>>>         -O1
>>>>>>>>         -ftree-vectorize
>>>>>>>>
>>>>>>>> In the 'diff' of 'a-builtin-bitops-1.c.179t.vect', for example, for
>>>>>>>> '-march=gfx90a' vs. '-march=gfx1100', we see:
>>>>>>>>
>>>>>>>>         +builtin-bitops-1.c:7:17: missed:   reduc op not supported by
>>>>>>>>         target.
>>>>>>>>
>>>>>>>> ..., and therefore:
>>>>>>>>
>>>>>>>>         -builtin-bitops-1.c:7:17: note:  Reduce using direct vector
>>>>>>>>         reduction.
>>>>>>>>         +builtin-bitops-1.c:7:17: note:  Reduce using vector shifts
>>>>>>>>         +builtin-bitops-1.c:7:17: note:  extract scalar result
>>>>>>>>
>>>>>>>> That is, instead of one '.REDUC_PLUS' for gfx90a, for gfx1100 we build
>>>>>>>> a
>>>>>>>> chain of summation of 'VEC_PERM_EXPR's.  However, there's wrong code
>>>>>>>> generated:
>>>>>>>>
>>>>>>>>         $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
>>>>>>>>         i=1, ints[i]=0x1 a=1, b=2
>>>>>>>>         i=2, ints[i]=0x80000000 a=1, b=2
>>>>>>>>         i=3, ints[i]=0x2 a=1, b=2
>>>>>>>>         i=4, ints[i]=0x40000000 a=1, b=2
>>>>>>>>         i=5, ints[i]=0x10000 a=1, b=2
>>>>>>>>         i=6, ints[i]=0x8000 a=1, b=2
>>>>>>>>         i=7, ints[i]=0xa5a5a5a5 a=16, b=32
>>>>>>>>         i=8, ints[i]=0x5a5a5a5a a=16, b=32
>>>>>>>>         i=9, ints[i]=0xcafe0000 a=11, b=22
>>>>>>>>         i=10, ints[i]=0xcafe00 a=11, b=22
>>>>>>>>         i=11, ints[i]=0xcafe a=11, b=22
>>>>>>>>         i=12, ints[i]=0xffffffff a=32, b=64
>>>>>>>>
>>>>>>>> (I can't tell if the 'b = 2 * a' pattern is purely coincidental?)
>>>>>>>>
>>>>>>>> I don't speak enough "vectorization" to fully understand the generic
>>>>>>>> vectorized algorithm and its implementation.  It appears that the
>>>>>>>> "Reduce using vector shifts" code has been around for a very long time,
>>>>>>>> but also has gone through a number of changes.  I can't tell which GCC
>>>>>>>> targets/configurations it's actually used for (in the same way as for
>>>>>>>> GCN gfx1100), and thus whether there's an issue in that vectorizer
>>>>>>>> code,
>>>>>>>> or rather in the GCN back end, or GCN back end parameterizing the
>>>>>>>> generic
>>>>>>>> code?
>>>>>>>
>>>>>>> The "shift" reduction is basically doing reduction by repeatedly
>>>>>>> adding the upper to the lower half of the vector (each time halving
>>>>>>> the vector size).
>>>>>>>
>>>>>>>> Manually working through the 'a-builtin-bitops-1.c.265t.optimized'
>>>>>>>> code:
>>>>>>>>
>>>>>>>>         int my_popcount (unsigned int x)
>>>>>>>>         {
>>>>>>>>           int stmp__12.12;
>>>>>>>>           vector(64) int vect__12.11;
>>>>>>>>           vector(64) unsigned int vect__1.8;
>>>>>>>>           vector(64) unsigned int _13;
>>>>>>>>           vector(64) unsigned int vect_cst__18;
>>>>>>>>           vector(64) int [all others];
>>>>>>>>         
>>>>>>>>           <bb 2> [local count: 32534376]:
>>>>>>>>           vect_cst__18 = { [all 'x_8(D)'] };
>>>>>>>>           vect__1.8_19 = vect_cst__18 >> { 0, 1, 2, [...], 61, 62, 63 };
>>>>>>>>           _13 = .COND_AND ({ [32 x '-1'], [32 x '0'] }, vect__1.8_19, {
>>>>>>>>           [all
>>>>>>>>           '1'] }, { [all '0'] });
>>>>>>>>           vect__12.11_24 = VIEW_CONVERT_EXPR<vector(64) int>(_13);
>>>>>>>>           _26 = VEC_PERM_EXPR <vect__12.11_24, { [all '0'] }, { 32, 33,
>>>>>>>>           34,
>>>>>>>>           [...], 93, 94, 95 }>;
>>>>>>>>           _27 = vect__12.11_24 + _26;
>>>>>>>>           _28 = VEC_PERM_EXPR <_27, { [all '0'] }, { 16, 17, 18, [...],
>>>>>>>>           77,
>>>>>>>>           78, 79 }>;
>>>>>>>>           _29 = _27 + _28;
>>>>>>>>           _30 = VEC_PERM_EXPR <_29, { [all '0'] }, { 8, 9, 10, [...],
>>>>>>>>           69,
>>>>>>>>           70,
>>>>>>>>           71 }>;
>>>>>>>>           _31 = _29 + _30;
>>>>>>>>           _32 = VEC_PERM_EXPR <_31, { [all '0'] }, { 4, 5, 6, [...], 65,
>>>>>>>>           66,
>>>>>>>>           67 }>;
>>>>>>>>           _33 = _31 + _32;
>>>>>>>>           _34 = VEC_PERM_EXPR <_33, { [all '0'] }, { 2, 3, 4, [...], 63,
>>>>>>>>           64,
>>>>>>>>           65 }>;
>>>>>>>>           _35 = _33 + _34;
>>>>>>>>           _36 = VEC_PERM_EXPR <_35, { [all '0'] }, { 1, 2, 3, [...], 62,
>>>>>>>>           63,
>>>>>>>>           64 }>;
>>>>>>>>           _37 = _35 + _36;
>>>>>>>>           stmp__12.12_38 = BIT_FIELD_REF <_37, 32, 0>;
>>>>>>>>           return stmp__12.12_38;
>>>>>>>>
>>>>>>>> ..., for example, for 'x = 7', we get:
>>>>>>>>
>>>>>>>>           vect_cst__18 = { [all '7'] };
>>>>>>>>           vect__1.8_19 = { 7, 3, 1, 0, 0, 0, [...] };
>>>>>>>>           _13 = { 1, 1, 1, 0, 0, 0, [...] };
>>>>>>>>           vect__12.11_24 = { 1, 1, 1, 0, 0, 0, [...] };
>>>>>>>>           _26 = { [all '0'] };
>>>>>>>>           _27 = { 1, 1, 1, 0, 0, 0, [...] };
>>>>>>>>           _28 = { [all '0'] };
>>>>>>>>           _29 = { 1, 1, 1, 0, 0, 0, [...] };
>>>>>>>>           _30 = { [all '0'] };
>>>>>>>>           _31 = { 1, 1, 1, 0, 0, 0, [...] };
>>>>>>>>           _32 = { [all '0'] };
>>>>>>>>           _33 = { 1, 1, 1, 0, 0, 0, [...] };
>>>>>>>>           _34 = { 1, 0, 0, 0, [...] };
>>>>>>>>           _35 = { 2, 1, 1, 0, 0, 0, [...] };
>>>>>>>>           _36 = { 1, 1, 0, 0, 0, [...] };
>>>>>>>>           _37 = { 3, 2, 1, 0, 0, 0, [...] };
>>>>>>>>           stmp__12.12_38 = 3;
>>>>>>>>           return 3;
>>>>>>>>
>>>>>>>> ..., so the algorithm would appear to synthesize correct code for that
>>>>>>>> case.  Adding '7' to 'builtin-bitops-1.c', we however again get:
>>>>>>>>
>>>>>>>>         i=13, ints[i]=0x7 a=3, b=6
>>>>>>>>
>>>>>>>>
>>>>>>>> With the following hack applied to 'gcc/tree-vect-loop.cc':
>>>>>>>>
>>>>>>>>         @@ -6687,8 +6687,9 @@ vect_create_epilog_for_reduction
>>>>>>>>         (loop_vec_info
>>>>>>>>         loop_vinfo,
>>>>>>>>                reduce_with_shift = have_whole_vector_shift (mode1);
>>>>>>>>                if (!VECTOR_MODE_P (mode1)
>>>>>>>>                   || !directly_supported_p (code, vectype1))
>>>>>>>>                 reduce_with_shift = false;
>>>>>>>>         +      reduce_with_shift = false;
>>>>>>>>
>>>>>>>> ..., I'm able to work around those regressions: by means of forcing
>>>>>>>> "Reduce using scalar code" instead of "Reduce using vector shifts".
>>>>>>>
>>>>>>> I would say it somewhere gets broken between the vectorizer and the GPU
>>>>>>> which means likely in the target?  Can you point out an issue in the
>>>>>>> actual generated GCN code?
>>>>>>>
>>>>>>> Iff this kind of reduction is the issue you'd see quite a lot of
>>>>>>> vectorzer execute FAILs.  I'm seeing a .COND_AND above - could it
>>>>>>> be that the "mask" is still set wrong when doing the reduction
>>>>>>> steps?
>>>>>>
>>>>>> It looks like the ds_bpermute_b32 instruction works differently on RDNA3
>>>>>> (vs.
>>>>>> GCN/CDNA and even RDNA2).
>>>>>>
>>>>>>    From the pseudocode in the documentation:
>>>>>>
>>>>>>      for i in 0 : WAVE64 ? 63 : 31 do
>>>>>>        // ADDR needs to be divided by 4.
>>>>>>        // High-order bits are ignored.
>>>>>>        // NOTE: destination lane is MOD 32 regardless of wave size.
>>>>>>        src_lane = 32'I(VGPR[i][ADDR] + OFFSET.b) / 4 % 32;
>>>>>>        // EXEC is applied to the source VGPR reads.
>>>>>>        if EXEC[src_lane].u1 then
>>>>>>          tmp[i] = VGPR[src_lane][DATA0]
>>>>>>        endif
>>>>>>      endfor;
>>>>>>
>>>>>> The key detail is the "mod 32"; the other architectures have "mod 64"
>>>>>> there.
>>>>>>
>>>>>> So, the last 32 lanes are discarded, and the first 32 lanes are
>>>>>> duplicated
>>>>>> into the last, and this explains why my_popcount returns double the
>>>>>> expected
>>>>>> value for smaller inputs.
>>>>>>
>>>>>> Richi, can you confirm that this testcase works properly on your card,
>>>>>> please?
>>>>>>
>>>>>> To test, assuming you only have the offload toolchain built, compile
>>>>>> using
>>>>>> x86_64-none-linux-gnu-accel-amdgcn-amdhsa-gcc, which should produce a raw
>>>>>> AMD
>>>>>> ELF file. Then you run it using "gcn-run a.out" (you can find gcn-run
>>>>>> under
>>>>>> libexec).
>>>>>
>>>>> I'm getting
>>>>>
>>>>> i=1, ints[i]=0x1 a=1, b=2
>>>>> i=2, ints[i]=0x80000000 a=1, b=2
>>>>> i=3, ints[i]=0x2 a=1, b=2
>>>>> i=4, ints[i]=0x40000000 a=1, b=2
>>>>> i=5, ints[i]=0x10000 a=1, b=2
>>>>> i=6, ints[i]=0x8000 a=1, b=2
>>>>> i=7, ints[i]=0xa5a5a5a5 a=16, b=32
>>>>> i=8, ints[i]=0x5a5a5a5a a=16, b=32
>>>>> i=9, ints[i]=0xcafe0000 a=11, b=22
>>>>> i=10, ints[i]=0xcafe00 a=11, b=22
>>>>> i=11, ints[i]=0xcafe a=11, b=22
>>>>> i=12, ints[i]=0xffffffff a=32, b=64
>>>>>
>>>>> which I think is the same as Thomas output and thus wrong?
>>>>>
>>>>> When building with -O0 I get no output.
>>>>>
>>>>> I'm of course building with -march=gfx1030
>>>>
>>>> OK, please try this example, just to check my expectation that your permute
>>>> works:
>>>>
>>>> typedef int v64si __attribute__ ((vector_size (256)));
>>>>
>>>> int main()
>>>> {
>>>>     v64si permute = {
>>>>       40, 40, 40, 40, 40, 40, 40, 40,
>>>>       40, 40, 40, 40, 40, 40, 40, 40,
>>>>       40, 40, 40, 40, 40, 40, 40, 40,
>>>>       40, 40, 40, 40, 40, 40, 40, 40,
>>>>       40, 40, 40, 40, 40, 40, 40, 40,
>>>>       40, 40, 40, 40, 40, 40, 40, 40,
>>>>       40, 40, 40, 40, 40, 40, 40, 40,
>>>>       40, 40, 40, 40, 40, 40, 40, 40
>>>>     };
>>>>     v64si result;
>>>>
>>>>     asm ("ds_bpermute_b32 %0, %1, v1" : "=v"(result) : "v"(permute),
>>>>     "e"(-1L));
>>>>
>>>>     for (int i=0; i<63; i++)
>>>>       __builtin_printf ("%d ", result[i]);
>>>>     __builtin_printf ("\n");
>>>>
>>>>     return 0;
>>>> }
>>>>
>>>> On GCN/CDNA devices I expect this to print "10" 64 times. On RDNA3 it
>>>> prints
>>>> "10" 32 times, and "42" 32 times (which doesn't quite match what I'd expect
>>>> from the pseudocode, but does match the written description). Which do you
>>>> get?
>>>
>>> 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
>>> 10 10 10 10 10 10 10 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42
>>> 42 42 42 42 42 42 42 42 42 42 42 42 42
>>>
>>> so RDNA2 matches RDNA3 here.
>>
>> OK, that probably is the problem with both our reductions then. The RDNA2
>> manual has the 32-lane wording in the description, but the instruction
>> pseudocode lies. :(
>>
>> I'm now not sure how to implement permute without actually hitting memory? The
>> permutation vector is exactly what we'd need to do a gather load from memory
>> (not a coincident), but we'd need to find a memory location to do it, ideally
>> in the low-latency LDS memory, and it'd have to be thread-safe.
>>
>> The attached not-well-tested patch should allow only valid permutations.
>> Hopefully we go back to working code, but there'll be things that won't
>> vectorize. That said, the new "dump" output code has fewer and probably
>> cheaper instructions, so hmmm.
> 
> This fixes the reduced builtin-bitops-1.c on RDNA2.
> 
> I suppse if RDNA really only has 32 lane vectors (it sounds like it,
> even if it can "simulate" 64 lane ones?) then it might make sense to
> vectorize for 32 lanes?  That said, with variable-length it likely
> doesn't matter but I'd not expose fixed-size modes with 64 lanes then?

For most operations, wavefrontsize=64 works just fine; the GPU runs each 
instruction twice and presents a pair of hardware registers as a logical 
64-lane register. This breaks down for permutations and reductions, and 
is obviously inefficient when they vectors are not fully utilized, but 
is otherwise compatible with the GCN/CDNA compiler.

I didn't want to invest all the effort it would take to support 
wavefrontsize=32, which would be the natural mode for these devices; the 
number of places that have "64" hard-coded is just too big. Not only 
that, but the EXEC and VCC registers change from DImode to SImode and 
that's going to break a lot of stuff. (And we have no paying customer 
for this.)

I'm open to patch submissions. :)

Andrew