From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linkw@linux.ibm.com>
Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com
 [148.163.156.1])
 by sourceware.org (Postfix) with ESMTPS id 943623858C20
 for <gcc-patches@gcc.gnu.org>; Tue, 16 Aug 2022 06:53:26 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 943623858C20
Received: from pps.filterd (m0098409.ppops.net [127.0.0.1])
 by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 27G6gw0d008513;
 Tue, 16 Aug 2022 06:53:25 GMT
Received: from pps.reinject (localhost [127.0.0.1])
 by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3j067207rh-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Tue, 16 Aug 2022 06:53:25 +0000
Received: from m0098409.ppops.net (m0098409.ppops.net [127.0.0.1])
 by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 27G6nScr002150;
 Tue, 16 Aug 2022 06:53:24 GMT
Received: from ppma05fra.de.ibm.com (6c.4a.5195.ip4.static.sl-reverse.com
 [149.81.74.108])
 by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3j067207q3-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Tue, 16 Aug 2022 06:53:24 +0000
Received: from pps.filterd (ppma05fra.de.ibm.com [127.0.0.1])
 by ppma05fra.de.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 27G6pMsb027225;
 Tue, 16 Aug 2022 06:53:22 GMT
Received: from b06cxnps3075.portsmouth.uk.ibm.com
 (d06relay10.portsmouth.uk.ibm.com [9.149.109.195])
 by ppma05fra.de.ibm.com with ESMTP id 3hyp8sgkec-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Tue, 16 Aug 2022 06:53:21 +0000
Received: from d06av25.portsmouth.uk.ibm.com (d06av25.portsmouth.uk.ibm.com
 [9.149.105.61])
 by b06cxnps3075.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id
 27G6rJsg24379878
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
 Tue, 16 Aug 2022 06:53:19 GMT
Received: from d06av25.portsmouth.uk.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id 7137611C04C;
 Tue, 16 Aug 2022 06:53:19 +0000 (GMT)
Received: from d06av25.portsmouth.uk.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id 1C78E11C04A;
 Tue, 16 Aug 2022 06:53:17 +0000 (GMT)
Received: from [9.197.241.24] (unknown [9.197.241.24])
 by d06av25.portsmouth.uk.ibm.com (Postfix) with ESMTP;
 Tue, 16 Aug 2022 06:53:16 +0000 (GMT)
Message-ID: <5df1a7fc-dacf-72e2-041d-66624926091f@linux.ibm.com>
Date: Tue, 16 Aug 2022 14:53:15 +0800
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0)
 Gecko/20100101 Thunderbird/91.6.1
Subject: Re: [PATCH v2] rs6000: Fix incorrect RTL for Power LE when removing
 the UNSPECS [PR106069]
Content-Language: en-US
To: Xionghu Luo <yinyuefengyi@gmail.com>
Cc: Xionghu Luo <xionghuluo@tencent.com>, gcc-patches@gcc.gnu.org,
 David Edelsohn <dje.gcc@gmail.com>,
 Segher Boessenkool <segher@kernel.crashing.org>
References: <20220808034247.2618809-1-xionghuluo@tencent.com>
 <ec28ad09-f23a-3ffc-3025-f0f52d0e773d@linux.ibm.com>
 <76035a5e-f0d8-8bc5-93e9-cfb08b2127f8@gmail.com>
 <20220810170700.GA25951@gate.crashing.org>
 <472c1531-aae6-123e-6b0c-8827f5585879@gmail.com>
From: "Kewen.Lin" <linkw@linux.ibm.com>
In-Reply-To: <472c1531-aae6-123e-6b0c-8827f5585879@gmail.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-TM-AS-GCONF: 00
X-Proofpoint-GUID: RkrsPr0sr5_Tw7lR6QI0siLA2rI4_6kF
X-Proofpoint-ORIG-GUID: Cn-ScPpFvIfF_HPTAK_NVgyRMCsNyIgf
X-Proofpoint-Virus-Version: vendor=baseguard
 engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1
 definitions=2022-08-16_06,2022-08-16_01,2022-06-22_01
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0
 priorityscore=1501
 malwarescore=0 lowpriorityscore=0 mlxlogscore=999 phishscore=0
 adultscore=0 impostorscore=0 clxscore=1015 mlxscore=0 bulkscore=0
 suspectscore=0 spamscore=0 classifier=spam adjust=0 reason=mlx scancount=1
 engine=8.12.0-2207270000 definitions=main-2208160024
X-Spam-Status: No, score=-9.1 required=5.0 tests=BAYES_00, BODY_8BITS,
 DKIM_SIGNED, DKIM_VALID, DKIM_VALID_EF, GIT_PATCH_0, NICE_REPLY_A,
 RCVD_IN_MSPIKE_H2, SCC_5_SHORT_WORD_LINES, SPF_HELO_NONE, SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Tue, 16 Aug 2022 06:53:31 -0000

Hi Xionghu,

Thanks for the updated version of patch, some comments are inlined.

on 2022/8/11 14:15, Xionghu Luo wrote:
> 
> 
> On 2022/8/11 01:07, Segher Boessenkool wrote:
>> On Wed, Aug 10, 2022 at 02:39:02PM +0800, Xionghu Luo wrote:
>>> On 2022/8/9 11:01, Kewen.Lin wrote:
>>>> I have some concern on those changed "altivec_*_direct", IMHO the suffix
>>>> "_direct" is normally to indicate the define_insn is mapped to the
>>>> corresponding hw insn directly.  With this change, for example,
>>>> altivec_vmrghb_direct can be mapped into vmrghb or vmrglb, this looks
>>>> misleading.  Maybe we can add the corresponding _direct_le and _direct_be
>>>> versions, both are mapped into the same insn but have different RTL
>>>> patterns.  Looking forward to Segher's and David's suggestions.
>>>
>>> Thanks!  Do you mean same RTL patterns with different hw insn?
>>
>> A pattern called altivec_vmrghb_direct_le should always emit a vmrghb
>> instruction, never a vmrglb instead.  Misleading names are an expensive
>> problem.
>>
>>
> 
> Thanks.  Then on LE platforms, if user calls altivec_vmrghw，it will be
> expanded to RTL (vec_select (vec_concat (R0 R1 (0 4 1 5))), and
> finally matched to altivec_vmrglw_direct_v4si_le with ASM "vmrglw".
> For BE just strict forward, seems more clear :-), OK for master?
> 
> 
> [PATCH v3] rs6000: Fix incorrect RTL for Power LE when removing the UNSPECS [PR106069]
> 
> v3: rename altivec_vmrghb_direct_le to altivec_vmrglb_direct_le to match
> the actual output ASM vmrglb. Likewise for all similar xxx_direct_le
> patterns.
> v2: Split the direct pattern to be and le with same RTL but different insn.
> 
> The native RTL expression for vec_mrghw should be same for BE and LE as
> they are register and endian-independent.  So both BE and LE need
> generate exactly same RTL with index [0 4 1 5] when expanding vec_mrghw
> with vec_select and vec_concat.
> 
> (set (reg:V4SI 141) (vec_select:V4SI (vec_concat:V8SI
>            (subreg:V4SI (reg:V16QI 139) 0)
>            (subreg:V4SI (reg:V16QI 140) 0))
>            [const_int 0 4 1 5]))
> 
> Then combine pass could do the nested vec_select optimization
> in simplify-rtx.c:simplify_binary_operation_1 also on both BE and LE:
> 
> 21: r150:V4SI=vec_select(vec_concat(r141:V4SI,r146:V4SI),parallel [0 4 1 5])
> 24: {r151:SI=vec_select(r150:V4SI,parallel [const_int 3]);}
> 
> =>
> 
> 21: r150:V4SI=vec_select(vec_concat(r141:V4SI,r146:V4SI),parallel)
> 24: {r151:SI=vec_select(r146:V4SI,parallel [const_int 1]);}
> 
> The endianness check need only once at ASM generation finally.
> ASM would be better due to nested vec_select simplified to simple scalar
> load.
> 
> Regression tested pass for Power8{LE,BE}{32,64} and Power{9,10}LE{64}
> Linux(Thanks to Kewen).
> 
> gcc/ChangeLog:
> 
>     PR target/106069
>     * config/rs6000/altivec.md (altivec_vmrghb_direct): Remove.
>     (altivec_vmrghb_direct_be): New pattern for BE.
>     (altivec_vmrglb_direct_le): New pattern for LE.
>     (altivec_vmrghh_direct): Remove.
>     (altivec_vmrghh_direct_be): New pattern for BE.
>     (altivec_vmrglh_direct_le): New pattern for LE.
>     (altivec_vmrghw_direct_<mode>): Remove.
>     (altivec_vmrghw_direct_<mode>_be): New pattern for BE.
>     (altivec_vmrglw_direct_<mode>_le): New pattern for LE.
>     (altivec_vmrglb_direct): Remove.
>     (altivec_vmrglb_direct_be): New pattern for BE.
>     (altivec_vmrghb_direct_le): New pattern for LE.
>     (altivec_vmrglh_direct): Remove.
>     (altivec_vmrglh_direct_be): New pattern for BE.
>     (altivec_vmrghh_direct_le): New pattern for LE.
>     (altivec_vmrglw_direct_<mode>): Remove.
>     (altivec_vmrglw_direct_<mode>_be): New pattern for BE.
>     (altivec_vmrghw_direct_<mode>_le): New pattern for LE.
>     * config/rs6000/rs6000.cc (altivec_expand_vec_perm_const):
>     Adjust.
>     * config/rs6000/vsx.md: Likewise.
> 
> gcc/testsuite/ChangeLog:
> 
>     PR target/106069
>     * g++.target/powerpc/pr106069.C: New test.
> 
> Signed-off-by: Xionghu Luo <xionghuluo@tencent.com>
> ---
>  gcc/config/rs6000/altivec.md                | 223 ++++++++++++++------
>  gcc/config/rs6000/rs6000.cc                 |  36 ++--
>  gcc/config/rs6000/vsx.md                    |  24 +--
>  gcc/testsuite/g++.target/powerpc/pr106069.C | 120 +++++++++++
>  4 files changed, 305 insertions(+), 98 deletions(-)
>  create mode 100644 gcc/testsuite/g++.target/powerpc/pr106069.C
> 
> diff --git a/gcc/config/rs6000/altivec.md b/gcc/config/rs6000/altivec.md
> index 2c4940f2e21..78245f470e9 100644
> --- a/gcc/config/rs6000/altivec.md
> +++ b/gcc/config/rs6000/altivec.md
> @@ -1144,15 +1144,17 @@ (define_expand "altivec_vmrghb"
>     (use (match_operand:V16QI 2 "register_operand"))]
>    "TARGET_ALTIVEC"
>  {
> -  rtx (*fun) (rtx, rtx, rtx) = BYTES_BIG_ENDIAN ? gen_altivec_vmrghb_direct
> -                        : gen_altivec_vmrglb_direct;
> -  if (!BYTES_BIG_ENDIAN)
> -    std::swap (operands[1], operands[2]);
> -  emit_insn (fun (operands[0], operands[1], operands[2]));
> +  rtvec v = gen_rtvec (16, GEN_INT (0), GEN_INT (16), GEN_INT (1), GEN_INT (17),
> +              GEN_INT (2), GEN_INT (18), GEN_INT (3), GEN_INT (19),
> +              GEN_INT (4), GEN_INT (20), GEN_INT (5), GEN_INT (21),
> +              GEN_INT (6), GEN_INT (22), GEN_INT (7), GEN_INT (23));
> +  rtx x = gen_rtx_VEC_CONCAT (V32QImode, operands[1], operands[2]);
> +  x = gen_rtx_VEC_SELECT (V16QImode, x, gen_rtx_PARALLEL (VOIDmode, v));
> +  emit_insn (gen_rtx_SET (operands[0], x));
>    DONE;
>  })

I think you can just call gen_altivec_vmrghb_direct_be and 
gen_altivec_vmrghb_direct_le separately here.  Similar for some other
define_expands.

>  
> -(define_insn "altivec_vmrghb_direct"
> +(define_insn "altivec_vmrghb_direct_be"
>    [(set (match_operand:V16QI 0 "register_operand" "=v")
>      (vec_select:V16QI
>        (vec_concat:V32QI
> @@ -1166,27 +1168,46 @@ (define_insn "altivec_vmrghb_direct"
>               (const_int 5) (const_int 21)
>               (const_int 6) (const_int 22)
>               (const_int 7) (const_int 23)])))]
> -  "TARGET_ALTIVEC"
> +  "TARGET_ALTIVEC && BYTES_BIG_ENDIAN"
>    "vmrghb %0,%1,%2"
>    [(set_attr "type" "vecperm")])
>  

Could you move the following altivec_vmrghb_direct_le here?
Then readers can easily check the difference between be and
le for the same altivec_vmrghb_direct.

Same comment applied for some other similar cases.

> +(define_insn "altivec_vmrglb_direct_le"
> +  [(set (match_operand:V16QI 0 "register_operand" "=v")
> +    (vec_select:V16QI
> +      (vec_concat:V32QI
> +        (match_operand:V16QI 1 "register_operand" "v")
> +        (match_operand:V16QI 2 "register_operand" "v"))
> +      (parallel [(const_int 0) (const_int 16)
> +             (const_int 1) (const_int 17)
> +             (const_int 2) (const_int 18)
> +             (const_int 3) (const_int 19)
> +             (const_int 4) (const_int 20)
> +             (const_int 5) (const_int 21)
> +             (const_int 6) (const_int 22)
> +             (const_int 7) (const_int 23)])))]
> +  "TARGET_ALTIVEC && !BYTES_BIG_ENDIAN"
> +  "vmrglb %0,%2,%1"
> +  [(set_attr "type" "vecperm")])

Could you update this pattern for assembly "vmrglb %0,%1,%2"
instead of "vmrglb %0,%2,%1"?  I checked the previous md
before the culprit commit 0910c516a3d72af048, it emits
"vmrglb %0,%1,%2" for altivec_vmrglb_direct.

Same comment applied for some other similar cases.

> +
>  (define_expand "altivec_vmrghh"
>    [(use (match_operand:V8HI 0 "register_operand"))
>     (use (match_operand:V8HI 1 "register_operand"))
>     (use (match_operand:V8HI 2 "register_operand"))]
>    "TARGET_ALTIVEC"
>  {
> -  rtx (*fun) (rtx, rtx, rtx) = BYTES_BIG_ENDIAN ? gen_altivec_vmrghh_direct
> -                        : gen_altivec_vmrglh_direct;
> -  if (!BYTES_BIG_ENDIAN)
> -    std::swap (operands[1], operands[2]);
> -  emit_insn (fun (operands[0], operands[1], operands[2]));
> +  rtvec v = gen_rtvec (8, GEN_INT (0), GEN_INT (8), GEN_INT (1), GEN_INT (9),
> +              GEN_INT (2), GEN_INT (10), GEN_INT (3), GEN_INT (11));
> +  rtx x = gen_rtx_VEC_CONCAT (V16HImode, operands[1], operands[2]);
> +
> +  x = gen_rtx_VEC_SELECT (V8HImode, x, gen_rtx_PARALLEL (VOIDmode, v));
> +  emit_insn (gen_rtx_SET (operands[0], x));
>    DONE;
>  })
>  
> -(define_insn "altivec_vmrghh_direct"
> +(define_insn "altivec_vmrghh_direct_be"
>    [(set (match_operand:V8HI 0 "register_operand" "=v")
> -        (vec_select:V8HI
> +    (vec_select:V8HI
>        (vec_concat:V16HI
>          (match_operand:V8HI 1 "register_operand" "v")
>          (match_operand:V8HI 2 "register_operand" "v"))
> @@ -1194,26 +1215,38 @@ (define_insn "altivec_vmrghh_direct"
>               (const_int 1) (const_int 9)
>               (const_int 2) (const_int 10)
>               (const_int 3) (const_int 11)])))]
> -  "TARGET_ALTIVEC"
> +  "TARGET_ALTIVEC && BYTES_BIG_ENDIAN"
>    "vmrghh %0,%1,%2"
>    [(set_attr "type" "vecperm")])
>  
> +(define_insn "altivec_vmrglh_direct_le"
> +  [(set (match_operand:V8HI 0 "register_operand" "=v")
> +    (vec_select:V8HI
> +      (vec_concat:V16HI
> +        (match_operand:V8HI 1 "register_operand" "v")
> +        (match_operand:V8HI 2 "register_operand" "v"))
> +      (parallel [(const_int 0) (const_int 8)
> +             (const_int 1) (const_int 9)
> +             (const_int 2) (const_int 10)
> +             (const_int 3) (const_int 11)])))]
> +  "TARGET_ALTIVEC && !BYTES_BIG_ENDIAN"
> +  "vmrglh %0,%2,%1"
> +  [(set_attr "type" "vecperm")])
> +
>  (define_expand "altivec_vmrghw"
>    [(use (match_operand:V4SI 0 "register_operand"))
>     (use (match_operand:V4SI 1 "register_operand"))
>     (use (match_operand:V4SI 2 "register_operand"))]
>    "VECTOR_MEM_ALTIVEC_P (V4SImode)"
>  {
> -  rtx (*fun) (rtx, rtx, rtx);
> -  fun = BYTES_BIG_ENDIAN ? gen_altivec_vmrghw_direct_v4si
> -             : gen_altivec_vmrglw_direct_v4si;
> -  if (!BYTES_BIG_ENDIAN)
> -    std::swap (operands[1], operands[2]);
> -  emit_insn (fun (operands[0], operands[1], operands[2]));
> +  rtvec v = gen_rtvec (4, GEN_INT (0), GEN_INT (4), GEN_INT (1), GEN_INT (5));
> +  rtx x = gen_rtx_VEC_CONCAT (V8SImode, operands[1], operands[2]);
> +  x = gen_rtx_VEC_SELECT (V4SImode, x, gen_rtx_PARALLEL (VOIDmode, v));
> +  emit_insn (gen_rtx_SET (operands[0], x));
>    DONE;
>  })
>  
> -(define_insn "altivec_vmrghw_direct_<mode>"
> +(define_insn "altivec_vmrghw_direct_<mode>_be"
>    [(set (match_operand:VSX_W 0 "register_operand" "=wa,v")
>      (vec_select:VSX_W
>        (vec_concat:<VS_double>
> @@ -1221,10 +1254,24 @@ (define_insn "altivec_vmrghw_direct_<mode>"
>          (match_operand:VSX_W 2 "register_operand" "wa,v"))
>        (parallel [(const_int 0) (const_int 4)
>               (const_int 1) (const_int 5)])))]
> -  "TARGET_ALTIVEC"
> +  "TARGET_ALTIVEC && BYTES_BIG_ENDIAN"
> +  "@
> +  xxmrghw %x0,%x1,%x2
> +  vmrghw %0,%1,%2"
> +  [(set_attr "type" "vecperm")])
> +
> +(define_insn "altivec_vmrglw_direct_<mode>_le"
> +  [(set (match_operand:VSX_W 0 "register_operand" "=wa,v")
> +    (vec_select:VSX_W
> +      (vec_concat:<VS_double>
> +        (match_operand:VSX_W 1 "register_operand" "wa,v")
> +        (match_operand:VSX_W 2 "register_operand" "wa,v"))
> +      (parallel [(const_int 0) (const_int 4)
> +             (const_int 1) (const_int 5)])))]
> +  "TARGET_ALTIVEC && !BYTES_BIG_ENDIAN"
>    "@
> -   xxmrghw %x0,%x1,%x2
> -   vmrghw %0,%1,%2"
> +  xxmrglw %x0,%x2,%x1
> +  vmrglw %0,%2,%1"
>    [(set_attr "type" "vecperm")])
>  
>  (define_insn "*altivec_vmrghsf"
> @@ -1250,15 +1297,17 @@ (define_expand "altivec_vmrglb"
>     (use (match_operand:V16QI 2 "register_operand"))]
>    "TARGET_ALTIVEC"
>  {
> -  rtx (*fun) (rtx, rtx, rtx) = BYTES_BIG_ENDIAN ? gen_altivec_vmrglb_direct
> -                        : gen_altivec_vmrghb_direct;
> -  if (!BYTES_BIG_ENDIAN)
> -    std::swap (operands[1], operands[2]);
> -  emit_insn (fun (operands[0], operands[1], operands[2]));
> +  rtvec v = gen_rtvec (16, GEN_INT (8), GEN_INT (24), GEN_INT (9), GEN_INT (25),
> +              GEN_INT (10), GEN_INT (26), GEN_INT (11), GEN_INT (27),
> +              GEN_INT (12), GEN_INT (28), GEN_INT (13), GEN_INT (29),
> +              GEN_INT (14), GEN_INT (30), GEN_INT (15), GEN_INT (31));
> +  rtx x = gen_rtx_VEC_CONCAT (V32QImode, operands[1], operands[2]);
> +  x = gen_rtx_VEC_SELECT (V16QImode, x, gen_rtx_PARALLEL (VOIDmode, v));
> +  emit_insn (gen_rtx_SET (operands[0], x));
>    DONE;
>  })
>  
> -(define_insn "altivec_vmrglb_direct"
> +(define_insn "altivec_vmrglb_direct_be"
>    [(set (match_operand:V16QI 0 "register_operand" "=v")
>      (vec_select:V16QI
>        (vec_concat:V32QI
> @@ -1272,25 +1321,43 @@ (define_insn "altivec_vmrglb_direct"
>               (const_int 13) (const_int 29)
>               (const_int 14) (const_int 30)
>               (const_int 15) (const_int 31)])))]
> -  "TARGET_ALTIVEC"
> +  "TARGET_ALTIVEC && BYTES_BIG_ENDIAN"
>    "vmrglb %0,%1,%2"
>    [(set_attr "type" "vecperm")])
>  
> +(define_insn "altivec_vmrghb_direct_le"
> +  [(set (match_operand:V16QI 0 "register_operand" "=v")
> +    (vec_select:V16QI
> +      (vec_concat:V32QI
> +        (match_operand:V16QI 1 "register_operand" "v")
> +        (match_operand:V16QI 2 "register_operand" "v"))
> +      (parallel [(const_int  8) (const_int 24)
> +             (const_int  9) (const_int 25)
> +             (const_int 10) (const_int 26)
> +             (const_int 11) (const_int 27)
> +             (const_int 12) (const_int 28)
> +             (const_int 13) (const_int 29)
> +             (const_int 14) (const_int 30)
> +             (const_int 15) (const_int 31)])))]
> +  "TARGET_ALTIVEC && !BYTES_BIG_ENDIAN"
> +  "vmrghb %0,%2,%1"
> +  [(set_attr "type" "vecperm")])
> +
>  (define_expand "altivec_vmrglh"
>    [(use (match_operand:V8HI 0 "register_operand"))
>     (use (match_operand:V8HI 1 "register_operand"))
>     (use (match_operand:V8HI 2 "register_operand"))]
>    "TARGET_ALTIVEC"
>  {
> -  rtx (*fun) (rtx, rtx, rtx) = BYTES_BIG_ENDIAN ? gen_altivec_vmrglh_direct
> -                        : gen_altivec_vmrghh_direct;
> -  if (!BYTES_BIG_ENDIAN)
> -    std::swap (operands[1], operands[2]);
> -  emit_insn (fun (operands[0], operands[1], operands[2]));
> +  rtvec v = gen_rtvec (8, GEN_INT (4), GEN_INT (12), GEN_INT (5), GEN_INT (13),
> +              GEN_INT (6), GEN_INT (14), GEN_INT (7), GEN_INT (15));
> +  rtx x = gen_rtx_VEC_CONCAT (V16HImode, operands[1], operands[2]);
> +  x = gen_rtx_VEC_SELECT (V8HImode, x, gen_rtx_PARALLEL (VOIDmode, v));
> +  emit_insn (gen_rtx_SET (operands[0], x));
>    DONE;
>  })
>  
> -(define_insn "altivec_vmrglh_direct"
> +(define_insn "altivec_vmrglh_direct_be"
>    [(set (match_operand:V8HI 0 "register_operand" "=v")
>          (vec_select:V8HI
>        (vec_concat:V16HI
> @@ -1300,26 +1367,38 @@ (define_insn "altivec_vmrglh_direct"
>               (const_int 5) (const_int 13)
>               (const_int 6) (const_int 14)
>               (const_int 7) (const_int 15)])))]
> -  "TARGET_ALTIVEC"
> +  "TARGET_ALTIVEC && BYTES_BIG_ENDIAN"
>    "vmrglh %0,%1,%2"
>    [(set_attr "type" "vecperm")])
>  
> +(define_insn "altivec_vmrghh_direct_le"
> +  [(set (match_operand:V8HI 0 "register_operand" "=v")
> +        (vec_select:V8HI
> +      (vec_concat:V16HI
> +        (match_operand:V8HI 1 "register_operand" "v")
> +        (match_operand:V8HI 2 "register_operand" "v"))
> +      (parallel [(const_int 4) (const_int 12)
> +             (const_int 5) (const_int 13)
> +             (const_int 6) (const_int 14)
> +             (const_int 7) (const_int 15)])))]
> +  "TARGET_ALTIVEC && !BYTES_BIG_ENDIAN"
> +  "vmrghh %0,%2,%1"
> +  [(set_attr "type" "vecperm")])
> +
>  (define_expand "altivec_vmrglw"
>    [(use (match_operand:V4SI 0 "register_operand"))
>     (use (match_operand:V4SI 1 "register_operand"))
>     (use (match_operand:V4SI 2 "register_operand"))]
>    "VECTOR_MEM_ALTIVEC_P (V4SImode)"
>  {
> -  rtx (*fun) (rtx, rtx, rtx);
> -  fun = BYTES_BIG_ENDIAN ? gen_altivec_vmrglw_direct_v4si
> -             : gen_altivec_vmrghw_direct_v4si;
> -  if (!BYTES_BIG_ENDIAN)
> -    std::swap (operands[1], operands[2]);
> -  emit_insn (fun (operands[0], operands[1], operands[2]));
> +  rtvec v = gen_rtvec (4, GEN_INT (2), GEN_INT (6), GEN_INT (3), GEN_INT (7));
> +  rtx x = gen_rtx_VEC_CONCAT (V8SImode, operands[1], operands[2]);
> +  x = gen_rtx_VEC_SELECT (V4SImode, x, gen_rtx_PARALLEL (VOIDmode, v));
> +  emit_insn (gen_rtx_SET (operands[0], x));
>    DONE;
>  })
>  
> -(define_insn "altivec_vmrglw_direct_<mode>"
> +(define_insn "altivec_vmrglw_direct_<mode>_be"
>    [(set (match_operand:VSX_W 0 "register_operand" "=wa,v")
>      (vec_select:VSX_W
>        (vec_concat:<VS_double>
> @@ -1327,10 +1406,24 @@ (define_insn "altivec_vmrglw_direct_<mode>"
>          (match_operand:VSX_W 2 "register_operand" "wa,v"))
>        (parallel [(const_int 2) (const_int 6)
>               (const_int 3) (const_int 7)])))]
> -  "TARGET_ALTIVEC"
> +  "TARGET_ALTIVEC && BYTES_BIG_ENDIAN"
> +  "@
> +  xxmrglw %x0,%x1,%x2
> +  vmrglw %0,%1,%2"
> +  [(set_attr "type" "vecperm")])
> +
> +(define_insn "altivec_vmrghw_direct_<mode>_le"
> +  [(set (match_operand:VSX_W 0 "register_operand" "=wa,v")
> +    (vec_select:VSX_W
> +      (vec_concat:<VS_double>
> +        (match_operand:VSX_W 1 "register_operand" "wa,v")
> +        (match_operand:VSX_W 2 "register_operand" "wa,v"))
> +      (parallel [(const_int 2) (const_int 6)
> +             (const_int 3) (const_int 7)])))]
> +  "TARGET_ALTIVEC && !BYTES_BIG_ENDIAN"
>    "@
> -   xxmrglw %x0,%x1,%x2
> -   vmrglw %0,%1,%2"
> +  xxmrghw %x0,%x2,%x1
> +  vmrghw %0,%2,%1"
>    [(set_attr "type" "vecperm")])
>  
>  (define_insn "*altivec_vmrglsf"
> @@ -3699,13 +3792,13 @@ (define_expand "vec_widen_umult_hi_v16qi"
>      {
>        emit_insn (gen_altivec_vmuleub (ve, operands[1], operands[2]));
>        emit_insn (gen_altivec_vmuloub (vo, operands[1], operands[2]));
> -      emit_insn (gen_altivec_vmrghh_direct (operands[0], ve, vo));
> +      emit_insn (gen_altivec_vmrghh_direct_be (operands[0], ve, vo));
>      }
>    else
>      {
>        emit_insn (gen_altivec_vmuloub (ve, operands[1], operands[2]));
>        emit_insn (gen_altivec_vmuleub (vo, operands[1], operands[2]));
> -      emit_insn (gen_altivec_vmrghh_direct (operands[0], vo, ve));
> +      emit_insn (gen_altivec_vmrghh_direct_le (operands[0], ve, vo));

Note that if you change assembly "vmrghh %0,%2,%1" to "vmrghh %0,%1,%2",
you need to change this to:

  emit_insn (gen_altivec_vmrghh_direct_le (operands[0], vo, ve));

Same comment applied for some other similar cases.

>      }
>    DONE;
>  })
> @@ -3724,13 +3817,13 @@ (define_expand "vec_widen_umult_lo_v16qi"
>      {
>        emit_insn (gen_altivec_vmuleub (ve, operands[1], operands[2]));
>        emit_insn (gen_altivec_vmuloub (vo, operands[1], operands[2]));
> -      emit_insn (gen_altivec_vmrglh_direct (operands[0], ve, vo));
> +      emit_insn (gen_altivec_vmrglh_direct_be (operands[0], ve, vo));
>      }
>    else
>      {
>        emit_insn (gen_altivec_vmuloub (ve, operands[1], operands[2]));
>        emit_insn (gen_altivec_vmuleub (vo, operands[1], operands[2]));
> -      emit_insn (gen_altivec_vmrglh_direct (operands[0], vo, ve));
> +      emit_insn (gen_altivec_vmrglh_direct_le (operands[0], ve, vo));
>      }
>    DONE;
>  })
> @@ -3749,13 +3842,13 @@ (define_expand "vec_widen_smult_hi_v16qi"
>      {
>        emit_insn (gen_altivec_vmulesb (ve, operands[1], operands[2]));
>        emit_insn (gen_altivec_vmulosb (vo, operands[1], operands[2]));
> -      emit_insn (gen_altivec_vmrghh_direct (operands[0], ve, vo));
> +      emit_insn (gen_altivec_vmrghh_direct_be (operands[0], ve, vo));
>      }
>    else
>      {
>        emit_insn (gen_altivec_vmulosb (ve, operands[1], operands[2]));
>        emit_insn (gen_altivec_vmulesb (vo, operands[1], operands[2]));
> -      emit_insn (gen_altivec_vmrghh_direct (operands[0], vo, ve));
> +      emit_insn (gen_altivec_vmrghh_direct_le (operands[0], ve, vo));
>      }
>    DONE;
>  })
> @@ -3774,13 +3867,13 @@ (define_expand "vec_widen_smult_lo_v16qi"
>      {
>        emit_insn (gen_altivec_vmulesb (ve, operands[1], operands[2]));
>        emit_insn (gen_altivec_vmulosb (vo, operands[1], operands[2]));
> -      emit_insn (gen_altivec_vmrglh_direct (operands[0], ve, vo));
> +      emit_insn (gen_altivec_vmrglh_direct_be (operands[0], ve, vo));
>      }
>    else
>      {
>        emit_insn (gen_altivec_vmulosb (ve, operands[1], operands[2]));
>        emit_insn (gen_altivec_vmulesb (vo, operands[1], operands[2]));
> -      emit_insn (gen_altivec_vmrglh_direct (operands[0], vo, ve));
> +      emit_insn (gen_altivec_vmrglh_direct_le (operands[0], ve, vo));
>      }
>    DONE;
>  })
> @@ -3799,13 +3892,13 @@ (define_expand "vec_widen_umult_hi_v8hi"
>      {
>        emit_insn (gen_altivec_vmuleuh (ve, operands[1], operands[2]));
>        emit_insn (gen_altivec_vmulouh (vo, operands[1], operands[2]));
> -      emit_insn (gen_altivec_vmrghw_direct_v4si (operands[0], ve, vo));
> +      emit_insn (gen_altivec_vmrghw_direct_v4si_be (operands[0], ve, vo));
>      }
>    else
>      {
>        emit_insn (gen_altivec_vmulouh (ve, operands[1], operands[2]));
>        emit_insn (gen_altivec_vmuleuh (vo, operands[1], operands[2]));
> -      emit_insn (gen_altivec_vmrghw_direct_v4si (operands[0], vo, ve));
> +      emit_insn (gen_altivec_vmrghw_direct_v4si_le (operands[0], ve, vo));
>      }
>    DONE;
>  })
> @@ -3824,13 +3917,13 @@ (define_expand "vec_widen_umult_lo_v8hi"
>      {
>        emit_insn (gen_altivec_vmuleuh (ve, operands[1], operands[2]));
>        emit_insn (gen_altivec_vmulouh (vo, operands[1], operands[2]));
> -      emit_insn (gen_altivec_vmrglw_direct_v4si (operands[0], ve, vo));
> +      emit_insn (gen_altivec_vmrglw_direct_v4si_be (operands[0], ve, vo));
>      }
>    else
>      {
>        emit_insn (gen_altivec_vmulouh (ve, operands[1], operands[2]));
>        emit_insn (gen_altivec_vmuleuh (vo, operands[1], operands[2]));
> -      emit_insn (gen_altivec_vmrglw_direct_v4si (operands[0], vo, ve));
> +      emit_insn (gen_altivec_vmrglw_direct_v4si_le (operands[0], ve, vo));
>      }
>    DONE;
>  })
> @@ -3849,13 +3942,13 @@ (define_expand "vec_widen_smult_hi_v8hi"
>      {
>        emit_insn (gen_altivec_vmulesh (ve, operands[1], operands[2]));
>        emit_insn (gen_altivec_vmulosh (vo, operands[1], operands[2]));
> -      emit_insn (gen_altivec_vmrghw_direct_v4si (operands[0], ve, vo));
> +      emit_insn (gen_altivec_vmrghw_direct_v4si_be (operands[0], ve, vo));
>      }
>    else
>      {
>        emit_insn (gen_altivec_vmulosh (ve, operands[1], operands[2]));
>        emit_insn (gen_altivec_vmulesh (vo, operands[1], operands[2]));
> -      emit_insn (gen_altivec_vmrghw_direct_v4si (operands[0], vo, ve));
> +      emit_insn (gen_altivec_vmrghw_direct_v4si_le (operands[0], ve, vo));
>      }
>    DONE;
>  })
> @@ -3874,13 +3967,13 @@ (define_expand "vec_widen_smult_lo_v8hi"
>      {
>        emit_insn (gen_altivec_vmulesh (ve, operands[1], operands[2]));
>        emit_insn (gen_altivec_vmulosh (vo, operands[1], operands[2]));
> -      emit_insn (gen_altivec_vmrglw_direct_v4si (operands[0], ve, vo));
> +      emit_insn (gen_altivec_vmrglw_direct_v4si_be (operands[0], ve, vo));
>      }
>    else
>      {
>        emit_insn (gen_altivec_vmulosh (ve, operands[1], operands[2]));
>        emit_insn (gen_altivec_vmulesh (vo, operands[1], operands[2]));
> -      emit_insn (gen_altivec_vmrglw_direct_v4si (operands[0], vo, ve));
> +      emit_insn (gen_altivec_vmrglw_direct_v4si_le (operands[0], ve, vo));
>      }
>    DONE;
>  })
> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> index df491bee2ea..97da7706f63 100644
> --- a/gcc/config/rs6000/rs6000.cc
> +++ b/gcc/config/rs6000/rs6000.cc
> @@ -22941,29 +22941,17 @@ altivec_expand_vec_perm_const (rtx target, rtx op0, rtx op1,
>      {OPTION_MASK_ALTIVEC,
>       CODE_FOR_altivec_vpkuwum_direct,
>       {2, 3, 6, 7, 10, 11, 14, 15, 18, 19, 22, 23, 26, 27, 30, 31}},
> -    {OPTION_MASK_ALTIVEC,
> -     BYTES_BIG_ENDIAN ? CODE_FOR_altivec_vmrghb_direct
> -              : CODE_FOR_altivec_vmrglb_direct,
> +    {OPTION_MASK_ALTIVEC, CODE_FOR_altivec_vmrghb_direct_be,
>       {0, 16, 1, 17, 2, 18, 3, 19, 4, 20, 5, 21, 6, 22, 7, 23}},

Before the culprit commit 0910c516a3d72af04, we have:

    { OPTION_MASK_ALTIVEC,
      (BYTES_BIG_ENDIAN ? CODE_FOR_altivec_vmrghb_direct
       : CODE_FOR_altivec_vmrglb_direct),
      {  0, 16,  1, 17,  2, 18,  3, 19,  4, 20,  5, 21,  6, 22,  7, 23 } },

I think we should use:

    { OPTION_MASK_ALTIVEC,
      (BYTES_BIG_ENDIAN ? CODE_FOR_altivec_vmrghb_direct_be
       : CODE_FOR_altivec_vmrglb_direct_le),
      {  0, 16,  1, 17,  2, 18,  3, 19,  4, 20,  5, 21,  6, 22,  7, 23 } },

here instead.  Similar comment for those related below.

> -    {OPTION_MASK_ALTIVEC,
> -     BYTES_BIG_ENDIAN ? CODE_FOR_altivec_vmrghh_direct
> -              : CODE_FOR_altivec_vmrglh_direct,
> +    {OPTION_MASK_ALTIVEC, CODE_FOR_altivec_vmrghh_direct_be,
>       {0, 1, 16, 17, 2, 3, 18, 19, 4, 5, 20, 21, 6, 7, 22, 23}},
> -    {OPTION_MASK_ALTIVEC,
> -     BYTES_BIG_ENDIAN ? CODE_FOR_altivec_vmrghw_direct_v4si
> -              : CODE_FOR_altivec_vmrglw_direct_v4si,
> +    {OPTION_MASK_ALTIVEC, CODE_FOR_altivec_vmrghw_direct_v4si_be,
>       {0, 1, 2, 3, 16, 17, 18, 19, 4, 5, 6, 7, 20, 21, 22, 23}},
> -    {OPTION_MASK_ALTIVEC,
> -     BYTES_BIG_ENDIAN ? CODE_FOR_altivec_vmrglb_direct
> -              : CODE_FOR_altivec_vmrghb_direct,
> +    {OPTION_MASK_ALTIVEC, CODE_FOR_altivec_vmrglb_direct_be,
>       {8, 24, 9, 25, 10, 26, 11, 27, 12, 28, 13, 29, 14, 30, 15, 31}},
> -    {OPTION_MASK_ALTIVEC,
> -     BYTES_BIG_ENDIAN ? CODE_FOR_altivec_vmrglh_direct
> -              : CODE_FOR_altivec_vmrghh_direct,
> +    {OPTION_MASK_ALTIVEC, CODE_FOR_altivec_vmrglh_direct_be,
>       {8, 9, 24, 25, 10, 11, 26, 27, 12, 13, 28, 29, 14, 15, 30, 31}},
> -    {OPTION_MASK_ALTIVEC,
> -     BYTES_BIG_ENDIAN ? CODE_FOR_altivec_vmrglw_direct_v4si
> -              : CODE_FOR_altivec_vmrghw_direct_v4si,
> +    {OPTION_MASK_ALTIVEC, CODE_FOR_altivec_vmrglw_direct_v4si_be,
>       {8, 9, 10, 11, 24, 25, 26, 27, 12, 13, 14, 15, 28, 29, 30, 31}},
>      {OPTION_MASK_P8_VECTOR,
>       BYTES_BIG_ENDIAN ? CODE_FOR_p8_vmrgew_v4sf_direct
> @@ -23146,9 +23134,15 @@ altivec_expand_vec_perm_const (rtx target, rtx op0, rtx op1,
>  
>            /* For little-endian, the two input operands must be swapped
>               (or swapped back) to ensure proper right-to-left numbering
> -             from 0 to 2N-1.  */
> -      if (swapped ^ !BYTES_BIG_ENDIAN
> -          && icode != CODE_FOR_vsx_xxpermdi_v16qi)
> +         from 0 to 2N-1.  Excludes the vmrg[lh][bhw] and xxpermdi ops.  */
> +      if (swapped ^ !BYTES_BIG_ENDIAN)
> +        if (!(icode == CODE_FOR_altivec_vmrghb_direct_be
> +          || icode == CODE_FOR_altivec_vmrglb_direct_be
> +          || icode == CODE_FOR_altivec_vmrghh_direct_be
> +          || icode == CODE_FOR_altivec_vmrglh_direct_be
> +          || icode == CODE_FOR_altivec_vmrghw_direct_v4si_be
> +          || icode == CODE_FOR_altivec_vmrglw_direct_v4si_be
> +          || icode == CODE_FOR_vsx_xxpermdi_v16qi))
>          std::swap (op0, op1);

IIUC, we don't need this part of change once we fix the operand order in
the assembly for those LE "direct"s.

BR,
Kewen

>        if (imode != V16QImode)
>          {
> diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
> index e226a93bbe5..2ae1bce131d 100644
> --- a/gcc/config/rs6000/vsx.md
> +++ b/gcc/config/rs6000/vsx.md
> @@ -4688,12 +4688,12 @@ (define_expand "vsx_xxmrghw_<mode>"
>               (const_int 1) (const_int 5)])))]
>    "VECTOR_MEM_VSX_P (<MODE>mode)"
>  {
> -  rtx (*fun) (rtx, rtx, rtx);
> -  fun = BYTES_BIG_ENDIAN ? gen_altivec_vmrghw_direct_<mode>
> -             : gen_altivec_vmrglw_direct_<mode>;
> -  if (!BYTES_BIG_ENDIAN)
> -    std::swap (operands[1], operands[2]);
> -  emit_insn (fun (operands[0], operands[1], operands[2]));
> +  if (BYTES_BIG_ENDIAN)
> +    emit_insn (
> +    gen_altivec_vmrghw_direct_v4si_be (operands[0], operands[1], operands[2]));
> +  else
> +    emit_insn (
> +    gen_altivec_vmrglw_direct_v4si_le (operands[0], operands[1], operands[2]));
>    DONE;
>  }
>    [(set_attr "type" "vecperm")])
> @@ -4708,12 +4708,12 @@ (define_expand "vsx_xxmrglw_<mode>"
>               (const_int 3) (const_int 7)])))]
>    "VECTOR_MEM_VSX_P (<MODE>mode)"
>  {
> -  rtx (*fun) (rtx, rtx, rtx);
> -  fun = BYTES_BIG_ENDIAN ? gen_altivec_vmrglw_direct_<mode>
> -             : gen_altivec_vmrghw_direct_<mode>;
> -  if (!BYTES_BIG_ENDIAN)
> -    std::swap (operands[1], operands[2]);
> -  emit_insn (fun (operands[0], operands[1], operands[2]));
> +  if (BYTES_BIG_ENDIAN)
> +    emit_insn (
> +    gen_altivec_vmrglw_direct_v4si_be (operands[0], operands[1], operands[2]));
> +  else
> +    emit_insn (
> +    gen_altivec_vmrghw_direct_v4si_le (operands[0], operands[1], operands[2]));
>    DONE;
>  }
>    [(set_attr "type" "vecperm")])
> diff --git a/gcc/testsuite/g++.target/powerpc/pr106069.C b/gcc/testsuite/g++.target/powerpc/pr106069.C
> new file mode 100644
> index 00000000000..2cde9b821e3
> --- /dev/null
> +++ b/gcc/testsuite/g++.target/powerpc/pr106069.C
> @@ -0,0 +1,120 @@
> +/* { dg-options "-O -fno-tree-forwprop -maltivec" } */
> +/* { dg-require-effective-target vmx_hw } */
> +/* { dg-do run } */
> +
> +extern "C" void *
> +memcpy (void *, const void *, unsigned long);
> +typedef __attribute__ ((altivec (vector__))) unsigned native_simd_type;
> +
> +union
> +{
> +  native_simd_type V;
> +  int R[4];
> +} store_le_vec;
> +
> +struct S
> +{
> +  S () = default;
> +  S (unsigned B0)
> +  {
> +    native_simd_type val{B0};
> +    m_simd = val;
> +  }
> +  void store_le (unsigned int out[])
> +  {
> +    store_le_vec.V = m_simd;
> +    unsigned int x0 = store_le_vec.R[0];
> +    memcpy (out, &x0, 4);
> +  }
> +  S rotl (unsigned int r)
> +  {
> +    native_simd_type rot{r};
> +    return __builtin_vec_rl (m_simd, rot);
> +  }
> +  void operator+= (S other)
> +  {
> +    m_simd = __builtin_vec_add (m_simd, other.m_simd);
> +  }
> +  void operator^= (S other)
> +  {
> +    m_simd = __builtin_vec_xor (m_simd, other.m_simd);
> +  }
> +  static void transpose (S &B0, S B1, S B2, S B3)
> +  {
> +    native_simd_type T0 = __builtin_vec_mergeh (B0.m_simd, B2.m_simd);
> +    native_simd_type T1 = __builtin_vec_mergeh (B1.m_simd, B3.m_simd);
> +    native_simd_type T2 = __builtin_vec_mergel (B0.m_simd, B2.m_simd);
> +    native_simd_type T3 = __builtin_vec_mergel (B1.m_simd, B3.m_simd);
> +    B0 = __builtin_vec_mergeh (T0, T1);
> +    B3 = __builtin_vec_mergel (T2, T3);
> +  }
> +  S (native_simd_type x) : m_simd (x) {}
> +  native_simd_type m_simd;
> +};
> +
> +void
> +foo (unsigned int output[], unsigned state[])
> +{
> +  S R00 = state[0];
> +  S R01 = state[0];
> +  S R02 = state[2];
> +  S R03 = state[0];
> +  S R05 = state[5];
> +  S R06 = state[6];
> +  S R07 = state[7];
> +  S R08 = state[8];
> +  S R09 = state[9];
> +  S R10 = state[10];
> +  S R11 = state[11];
> +  S R12 = state[12];
> +  S R13 = state[13];
> +  S R14 = state[4];
> +  S R15 = state[15];
> +  for (int r = 0; r != 10; ++r)
> +    {
> +      R09 += R13;
> +      R11 += R15;
> +      R05 ^= R09;
> +      R06 ^= R10;
> +      R07 ^= R11;
> +      R07 = R07.rotl (7);
> +      R00 += R05;
> +      R01 += R06;
> +      R02 += R07;
> +      R15 ^= R00;
> +      R12 ^= R01;
> +      R13 ^= R02;
> +      R00 += R05;
> +      R01 += R06;
> +      R02 += R07;
> +      R15 ^= R00;
> +      R12 = R12.rotl (8);
> +      R13 = R13.rotl (8);
> +      R10 += R15;
> +      R11 += R12;
> +      R08 += R13;
> +      R09 += R14;
> +      R05 ^= R10;
> +      R06 ^= R11;
> +      R07 ^= R08;
> +      R05 = R05.rotl (7);
> +      R06 = R06.rotl (7);
> +      R07 = R07.rotl (7);
> +    }
> +  R00 += state[0];
> +  S::transpose (R00, R01, R02, R03);
> +  R00.store_le (output);
> +}
> +
> +unsigned int res[1];
> +unsigned main_state[]{1634760805, 60878,      2036477234, 6,
> +              0,      825562964,  1471091955, 1346092787,
> +              506976774,  4197066702, 518848283,  118491664,
> +              0,      0,          0,      0};
> +int
> +main ()
> +{
> +  foo (res, main_state);
> +  if (res[0] != 0x41fcef98)
> +    __builtin_abort ();
> +}