From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=lXZf=NF=linux.ibm.com=aagarwa1@sourceware.org>
Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5])
	by sourceware.org (Postfix) with ESMTPS id 0339E39730F1
	for <gcc-patches@gcc.gnu.org>; Mon,  3 Jun 2024 11:05:38 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 0339E39730F1
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=linux.ibm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=linux.ibm.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 0339E39730F1
Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=148.163.158.5
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1717412741; cv=none;
	b=qmQLUVZzXK0zy6zFiOrd08sL9+ZSrG0vCigLJ9VVztVAtx3YKFFSFb0Ff8MDktxd53DPJo610JiKapUtqjXw/pHa6LGhKML7hwKMm3k3JSMMZF+iVtmudBt5Eh5AJtqvKjgkosh9HwDe1bU25uAqUr7ULxXmpwpMeailHEHTFig=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
	t=1717412741; c=relaxed/simple;
	bh=gK9GRooV/mLEXuPGzxy06N3YPYWuCZQmgUDZDPW2fCU=;
	h=DKIM-Signature:Message-ID:Date:MIME-Version:Subject:To:From; b=LbWmPWgVVMIsZfouP22UhXr1brNk3j5o3yqWvNKdyS3IQ4C06QTtXm/tGlGjf1gOr5R62agU2DpXoOe9GL/erAgg6PNmZxxpVSirP9Qts+P38a6BpynFVAVxJ8ttwuptXkF0KrgNKg7BcJWdwDiar3PuRun1pupiKnEBIra1Vhg=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: from pps.filterd (m0353723.ppops.net [127.0.0.1])
	by mx0a-001b2d01.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 4539Q9RS022151;
	Mon, 3 Jun 2024 11:05:35 GMT
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com;
 h=content-transfer-encoding : content-type : date : from : in-reply-to :
 message-id : mime-version : references : subject : to; s=pp1;
 bh=/qvoYZVMO8EM1Jwyuui0GxRFTI5Orrsuf6qbrwzjzEg=;
 b=O7aik752CuoWlh5DjWckZlMnMU3VBPuqB3cXEC2AbiEzg8S7Y7y489i862F6We87L+Gf
 4qXrO3lwwchIzU2uFqhtLkwHNRwUuhIJTqMtHC6lLYSdU+iSd01W3KhASL+usSaRn6xw
 WSN4IRh+fTm1Yas0+JHVa5UYQqjhdfYO1XL2x8sGdXydUqNJ8wK/DAoHcpRxGWxE8TKF
 vnDaJc5aXHRDF+zjOkrCrI0Q9k5PrC2mpZvgCf+8TtkO16I/fgt4lxwQp/rY1GKXrAGk
 0yjDkMwVoDQ6IrDYTLYBlMOXHEoAIBcY1YRT0nD0gYpofEyvj2RVSyqNdgSoN229BW+3 TQ== 
Received: from pps.reinject (localhost [127.0.0.1])
	by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3yh9w8ges6-1
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
	Mon, 03 Jun 2024 11:05:34 +0000
Received: from m0353723.ppops.net (m0353723.ppops.net [127.0.0.1])
	by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 453B5Yap014177;
	Mon, 3 Jun 2024 11:05:34 GMT
Received: from ppma22.wdc07v.mail.ibm.com (5c.69.3da9.ip4.static.sl-reverse.com [169.61.105.92])
	by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3yh9w8ges1-1
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
	Mon, 03 Jun 2024 11:05:34 +0000
Received: from pps.filterd (ppma22.wdc07v.mail.ibm.com [127.0.0.1])
	by ppma22.wdc07v.mail.ibm.com (8.17.1.19/8.17.1.19) with ESMTP id 4538b947008468;
	Mon, 3 Jun 2024 11:05:33 GMT
Received: from smtprelay03.dal12v.mail.ibm.com ([172.16.1.5])
	by ppma22.wdc07v.mail.ibm.com (PPS) with ESMTPS id 3ygec0fm34-1
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
	Mon, 03 Jun 2024 11:05:33 +0000
Received: from smtpav05.dal12v.mail.ibm.com (smtpav05.dal12v.mail.ibm.com [10.241.53.104])
	by smtprelay03.dal12v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 453B5UYD41156968
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
	Mon, 3 Jun 2024 11:05:32 GMT
Received: from smtpav05.dal12v.mail.ibm.com (unknown [127.0.0.1])
	by IMSVA (Postfix) with ESMTP id 936F458147;
	Mon,  3 Jun 2024 11:05:30 +0000 (GMT)
Received: from smtpav05.dal12v.mail.ibm.com (unknown [127.0.0.1])
	by IMSVA (Postfix) with ESMTP id C603C5811A;
	Mon,  3 Jun 2024 11:05:27 +0000 (GMT)
Received: from [9.36.9.27] (unknown [9.36.9.27])
	by smtpav05.dal12v.mail.ibm.com (Postfix) with ESMTP;
	Mon,  3 Jun 2024 11:05:27 +0000 (GMT)
Message-ID: <9efb06e2-74f1-42f1-8a52-931d13a57ebc@linux.ibm.com>
Date: Mon, 3 Jun 2024 16:35:25 +0530
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [Patch, rs6000, aarch64, middle-end] Add implementation for
 different targets for pair mem fusion
To: Alex Coplan <alex.coplan@arm.com>, "Kewen.Lin" <linkw@linux.ibm.com>,
        Segher Boessenkool <segher@kernel.crashing.org>,
        Michael Meissner <meissner@linux.ibm.com>,
        Peter Bergner <bergner@linux.ibm.com>,
        David Edelsohn <dje.gcc@gmail.com>,
        gcc-patches <gcc-patches@gcc.gnu.org>, richard.sandiford@arm.com
References: <53ba46de-6c01-4c68-bd98-1ba6950a793a@linux.ibm.com>
 <mpt34py71e8.fsf@arm.com>
 <cc31b3fb-552b-43bd-9029-5b84f6e6c437@linux.ibm.com>
 <mpted9i59n1.fsf@arm.com>
 <95a33b0a-8090-4218-a62c-da1f53bebbb7@linux.ibm.com>
 <mpt4jaa2zh9.fsf@arm.com>
Content-Language: en-US
From: Ajit Agarwal <aagarwa1@linux.ibm.com>
In-Reply-To: <mpt4jaa2zh9.fsf@arm.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-TM-AS-GCONF: 00
X-Proofpoint-ORIG-GUID: dZK6zxXJeRNkYxv1gFLQK7R-epgBzgTx
X-Proofpoint-GUID: bNdgieuWlZg9vTF1uy02Ly-WpLKuPFa4
X-Proofpoint-Virus-Version: vendor=baseguard
 engine=ICAP:2.0.293,Aquarius:18.0.1039,Hydra:6.0.650,FMLib:17.12.28.16
 definitions=2024-06-03_07,2024-05-30_01,2024-05-17_01
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501
 impostorscore=0 mlxlogscore=999 phishscore=0 adultscore=0 clxscore=1015
 spamscore=0 suspectscore=0 lowpriorityscore=0 mlxscore=0 bulkscore=0
 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1
 engine=8.12.0-2405010000 definitions=main-2406030092
X-Spam-Status: No, score=-4.8 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_EF,KAM_MANYTO,KAM_SHORT,RCVD_IN_MSPIKE_H4,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Hello Richard:

On 03/06/24 2:07 pm, Richard Sandiford wrote:
> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>> Hello Richard:
>> On 31/05/24 8:08 pm, Richard Sandiford wrote:
>>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>> On 31/05/24 3:23 pm, Richard Sandiford wrote:
>>>>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>>>> Hello All:
>>>>>>
>>>>>> Common infrastructure using generic code for pair mem fusion of different
>>>>>> targets.
>>>>>>
>>>>>> rs6000 target specific specific code implements virtual functions defined
>>>>>> by generic code.
>>>>>>
>>>>>> Code is implemented with pure virtual functions to interface with target
>>>>>> code.
>>>>>>
>>>>>> Target specific code are added in rs6000-mem-fusion.cc and additional virtual
>>>>>> function implementation required for rs6000 are added in aarch64-ldp-fusion.cc.
>>>>>>
>>>>>> Bootstrapped and regtested for aarch64-linux-gnu and powerpc64-linux-gnu.
>>>>>>
>>>>>> Thanks & Regards
>>>>>> Ajit
>>>>>>
>>>>>>
>>>>>> aarch64, rs6000, middle-end: Add implementation for different targets for pair mem fusion
>>>>>>
>>>>>> Common infrastructure using generic code for pair mem fusion of different
>>>>>> targets.
>>>>>>
>>>>>> rs6000 target specific specific code implements virtual functions defined
>>>>>> by generic code.
>>>>>>
>>>>>> Code is implemented with pure virtual functions to interface with target
>>>>>> code.
>>>>>>
>>>>>> Target specific code are added in rs6000-mem-fusion.cc and additional virtual
>>>>>> function implementation required for rs6000 are added in aarch64-ldp-fusion.cc.
>>>>>>
>>>>>> 2024-05-31  Ajit Kumar Agarwal  <aagarwa1@linux.ibm.com>
>>>>>>
>>>>>> gcc/ChangeLog:
>>>>>>
>>>>>> 	* config/aarch64/aarch64-ldp-fusion.cc: Add target specific
>>>>>> 	implementation of additional virtual functions added in pair_fusion
>>>>>> 	struct.
>>>>>> 	* config/rs6000/rs6000-passes.def: New mem fusion pass
>>>>>> 	before pass_early_remat.
>>>>>> 	* config/rs6000/rs6000-mem-fusion.cc: Add new pass.
>>>>>> 	Add target specific implementation using pure virtual
>>>>>> 	functions.
>>>>>> 	* config.gcc: Add new object file.
>>>>>> 	* config/rs6000/rs6000-protos.h: Add new prototype for mem
>>>>>> 	fusion pass.
>>>>>> 	* config/rs6000/t-rs6000: Add new rule.
>>>>>> 	* rtl-ssa/accesses.h: Moved set_is_live_out_use as public
>>>>>> 	from private.
>>>>>>
>>>>>> gcc/testsuite/ChangeLog:
>>>>>>
>>>>>> 	* g++.target/powerpc/me-fusion.C: New test.
>>>>>> 	* g++.target/powerpc/mem-fusion-1.C: New test.
>>>>>> 	* gcc.target/powerpc/mma-builtin-1.c: Modify test.
>>>>>> ---
>>>>>
>>>>> This isn't a complete review, just some initial questions & comments
>>>>> about selected parts.
>>>>>
>>>>>> [...]
>>>>>> +/* Check whether load can be fusable or not.
>>>>>> +   Return true if dependent use is UNSPEC otherwise false.  */
>>>>>> +bool
>>>>>> +rs6000_pair_fusion::fuseable_load_p (insn_info *info)
>>>>>> +{
>>>>>> +  rtx_insn *insn = info->rtl ();
>>>>>> +
>>>>>> +  for (rtx note = REG_NOTES (insn); note; note = XEXP (note, 1))
>>>>>> +    if (REG_NOTE_KIND (note) == REG_EQUAL
>>>>>> +	|| REG_NOTE_KIND (note) == REG_EQUIV)
>>>>>> +      return false;
>>>>>
>>>>> It's unusual to punt on an optimisation because of a REG_EQUAL/EQUIV
>>>>> note.  What's the reason for doing this?  Are you trying to avoid
>>>>> fusing pairs before reload that are equivalent to a MEM (i.e. have
>>>>> a natural spill slot)?  I think Alex hit a similar situation.
>>>>>
>>>>
>>>> We have used the above check because of some SPEC benchmarks failing with
>>>> with MEM pairs having REG_EQUAL/EQUIV notes.
>>>>
>>>> By adding the checks the benchmarks passes and also it improves the
>>>> performance.
>>>>
>>>> This checks were added during initial implementation of pair fusion
>>>> pass.
>>>>
>>>> I will investigate further if this check is still required or not.
>>>
>>> Thanks.  If it does affect SPEC results, it would be good to look
>>> at the underlying reason, as a justification for the check.
>>>
>>> AIUI, the case Alex was due to the way that the RA recognises:
>>>
>>>   (set (reg R) (mem address-of-a-stack-variable))
>>>     REG_EQUIV: (mem address-of-a-stack-variable)
>>>
>>> where the REG_EQUIV is either explicit or detected by the RA.
>>> If R needs to be spilled, it can then be spilled to its existing
>>> location on the stack.  And if R needs to be spilled in the
>>> instruction above (because of register pressure before the first
>>> use of R), the RA is able to delete the instruction.
>>>
>>> But if that is the reason, the condition should be restricted
>>> to cases in which the note is a memory.
>>>
>>> I think Alex had tried something similar and found that it wasn't
>>> always effective.
>>>
>>
>> Sure I will check.
>>>> [...]
>>>>>> +		   && info->is_real ())
>>>>>> +		  {
>>>>>> +		    rtx_insn *rtl_insn = info->rtl ();
>>>>>> +		    rtx set = single_set (rtl_insn);
>>>>>> +
>>>>>> +		    if (set == NULL_RTX)
>>>>>> +		      return false;
>>>>>> +
>>>>>> +		    rtx op0 = SET_SRC (set);
>>>>>> +		    if (GET_CODE (op0) != UNSPEC)
>>>>>> +		      return false;
>>>>>
>>>>> What's the motivation for rejecting unspecs?  It's unsual to treat
>>>>> all unspecs as a distinct group.
>>>>>
>>>>> Also, using single_set means that the function still lets through
>>>>> parallels of two sets in which the sources are unspecs.  Is that
>>>>> intentional?
>>>>>
>>>>> The reasons behind things like the REG_EQUAL/EQUIV and UNSPEC decisions
>>>>> need to be described in comments, so that other people coming to this
>>>>> code later can understand the motivation.  The same thing applies to
>>>>> other decisions in the patch.
>>>>>
>>>>
>>>> Adjacent load pair fusion with 256 bit OOmode is seen and valid with use of load
>>>> in UNSPEC. Thats why this check is added.
>>>
>>> Can you give an example (in terms of rtl)?
>>
>>
>> (insn 23 72 25 4 (set (reg:OO 124 [ vect__1.24 ])
>>         (mem:OO (reg:DI 118 [ ivtmp.73 ]) [0 MEM <vector(16) unsigned char> [(unsigned char *)_65]+0 S16 A8]))  {*movoo}
>>      (nil))
>> (insn 25 23 103 4 (set (reg:OO 174 [ vect__1.26 ])
>>         (mem:OO (plus:DI (reg:DI 118 [ ivtmp.73 ])
>>                 (const_int 32 [0x20])) [0 MEM <vector(16) unsigned char> [(unsigned char *)_65 + 32B]+0 S16 A8]))  {*movoo}
>>      (nil))
>> (insn 103 25 27 4 (set (reg:DI 118 [ ivtmp.73 ])
>>         (plus:DI (reg:DI 118 [ ivtmp.73 ])
>>             (const_int 64 [0x40]))) 66 {*adddi3}
>>      (nil))
>> (insn 27 103 29 4 (set (reg:V16QI 176 [ vect_perm_even_123 ])
>>         (unspec:V16QI [
>>                 (subreg:V16QI (reg:OO 124 [ vect__1.24 ]) 0)
>>                 (subreg:V16QI (reg:OO 124 [ vect__1.24 ]) 16)
>>                 (reg:V16QI 300)
>>             ] UNSPEC_VPERM))  {altivec_vperm_v16qi_direct}
>>      (nil))
>>
>> This is the example where UNSPEC is used with 256 bit OOmode.
> 
> But what (conceptually) makes the vperm instruction interesting?
> I.e. what is it about the instruction that makes you want to look for
> it?
> 
The example above used UNSPEC with vperm but this patch handles
all the variants of UNSPEC instructions.

> Is it that the instruction uses both halves of the fused load?  If so,
> I don't think the test in the patch achieves that.  ISTM that the patch
> checks both loads individually, and so would also match cases where one
> load is used by one vperm and the other load is used by a different vperm.
> Is that the intention?
>

Yes the patch doesn't assumes that UNSPEC instruction uses both
halves of load. The patch checks both load individually and can
handles where one load used by one vperm and other load is
used by different vperm.

 
> It would also match cases where the vperm inputs are the other around,
> so that the first operand is loaded from a higher address than the
> second operand.  Is that ok, or is it a problem?
>

Yes the patch can handle this case also.


> Similarly, there are other instructions that have unspecs, such as:
> 
> (define_insn "*insert4b_internal"
>   [(set (match_operand:V16QI 0 "vsx_register_operand" "=wa")
> 	(unspec:V16QI [(match_operand:V4SI 1 "vsx_register_operand" "wa")
> 		       (match_operand:V16QI 2 "vsx_register_operand" "0")
> 		       (match_operand:QI 3 "const_0_to_12_operand" "n")]
> 		   UNSPEC_XXINSERTW))]
> 
> (to pick an arbitrary example).  It looks like the patch would also
> accept uses by this instruction.  Is that intentional?
>

Yes patch can accept above UNSPEC.
The patch can handle all variants of UNSPEC instructions.

 
> If it is intentional, what distinguishes things like vperm and xxinsertw
> (and all other unspecs) from plain addition?
> 
>   [(set (match_operand:VSX_F 0 "vsx_register_operand" "=wa")
>         (plus:VSX_F (match_operand:VSX_F 1 "vsx_register_operand" "wa")
> 		    (match_operand:VSX_F 2 "vsx_register_operand" "wa")))]
>

Plain addition are not supported currently.
We have not seen many cases with plain addition and this patch
will not accept plain addition.

 
> This is why the intention behind the patch is important.  As it stands,
> it isn't clear what criteria the patch is using to distinguish "valid"
> fuse candidates from "invalid" ones.
>

Intention behind this patch all variants of UNSPEC instructions are
supported and uses without UNSPEC are not supported in this patch.

 
>>>>>> [...]
>>>>>> +  // Given insn_info pair I1 and I2, return true if offsets are in order.
>>>>>> +  virtual bool should_handle_unordered_insns (rtl_ssa::insn_info *i1,
>>>>>> +					      rtl_ssa::insn_info *i2) = 0;
>>>>>> +
>>>>>
>>>>> This name seems a bit misleading.  The function is used in:
>>>>>
>>>>> @@ -2401,6 +2405,9 @@ pair_fusion_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
>>>>>        reversed = true;
>>>>>      }
>>>>>  
>>>>> +  if (!m_pass->should_handle_unordered_insns (i1, i2))
>>>>> +    return false;
>>>>> +
>>>>>    rtx cand_mems[2];
>>>>>    rtx reg_ops[2];
>>>>>    rtx pats[2];
>>>>>
>>>>> and so it acts as a general opt-out.  The insns aren't known to be unordered.
>>>>>
>>>>> It looks like the rs6000 override requires the original insns to be
>>>>> in offset order.  Could you say why that's necessary?  (Both in email
>>>>> and as a comment in the code.)
>>>>>
>>>>
>>>> Yes rs6000 requires the original load insns to be in offset order.
>>>> Some regression tests like vect-outer-4f fails if we do load pair
>>>> fusion with load offsets are not in offset order as this breaks lxvp 
>>>> semantics.
>>>
>>> How does it break the semantics though?  In principle, the generic code
>>> only fuses if it has "proved" that the loads can happen in either order.
>>> So it shouldn't matter which order the hardware does things in.
>>>
>>> Could you give an example of the kind of situation that you want
>>> to avoid, and why it generates the wrong result?
>>>
>>
>> (insn 31 62 32 2 (set (reg:V16QI 177 [ MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 64B] ])
>>         (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
>>                 (const_int 64 [0x40])) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 64B]+0 S16 A16]))  {vsx_movv16qi_64bit}
>>      (nil))
>> (insn 32 31 16 2 (set (reg:V16QI 178 [ MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 80B] ])
>>         (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
>>                 (const_int 80 [0x50])) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 80B]+0 S16 A16]))  {vsx_movv16qi_64bit}
>>      (nil))
>> (insn 16 32 21 2 (set (reg:V16QI 159 [ MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 16B] ])
>>         (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
>>                 (const_int 16 [0x10])) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 16B]+0 S16 A16]))  {vsx_movv16qi_64bit}
>>      (nil))
>> (insn 21 16 22 2 (set (reg:V16QI 165 [ MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 32B] ])
>>         (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
>>                 (const_int 32 [0x20])) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 32B]+0 S16 A16])) {vsx_movv16qi_64bit}
>>      (nil))
>> (insn 22 21 37 2 (set (reg:V16QI 166 [ MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 48B] ])
>>         (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
>>                 (const_int 48 [0x30])) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 48B]+0 S16 A16])) {vsx_movv16qi_64bit}
>>      (nil))
>>
>> insn 22 and insn 31 is merged in the failure case and breaks the code.
> 
> What specifically goes wrong though?  This is just a sequence of loads
> from the same base pointer, with no interdependencies, so it should be
> possible to do the loads in any order.
>

Here in fuse_pair we set first and second based  as follows:

  insn_info *first = (*i1 < *i2) ? i1 : i2;
  insn_info *second = (first == i1) ? i2 : i1;

This makes higher offset with first and lower offset with second.
if (*i1 > *i2).

and in set_multiword_subreg interface we pass first and second.
Hence above code breaks because subreg offsets with 256 bits are not set properly.

If we pass i1 and i2 in set_multiword_subreg (i1, i2, load_p)
in fuse_pair should_handle_unordered_insns is not required in try_fuse_pair.

I will send the patch by removing the above interface check 
in try_fuse_pair and pass i1 and i2 in set_multiword_subreg

This works fine.

Sorry for the inconvenience caused.
 
> Thanks,
> Richard

Thanks & Regards
Ajit