From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=K12V=HM=linux.ibm.com=aagarwa1@sourceware.org>
Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5])
	by sourceware.org (Postfix) with ESMTPS id DAD5B385C335
	for <gcc-patches@gcc.gnu.org>; Fri,  1 Dec 2023 09:10:20 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org DAD5B385C335
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=linux.ibm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=linux.ibm.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org DAD5B385C335
Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=148.163.158.5
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1701421825; cv=none;
	b=E3aNLjGS4QfnKtCTCb+vI3PaV3ag4cCE5dsO3VwiThVgPFLUYYkmVUGbe6Q9U0dO0QzcZinGdhj/iGb6gKTxZACijgQLMK3pbSu4llgF8VteFMH+Lk5ooIL/iLlia8w8dOiFDh+Ui9+ecBaShUL6+TK7nPqw2ZdcJb9WSoEtN1g=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
	t=1701421825; c=relaxed/simple;
	bh=WHXQ/8mXlaq9Au7IPwN+xdY7lvlgPjBBH4pPsVeOlRw=;
	h=DKIM-Signature:Message-ID:Date:Subject:To:From:MIME-Version; b=tlnTeq2A5cExM57mCX7/pmKo9ON+2hviPBd4EmJo/Tg/kxGzFQuVdv1QLY4V5OsPDRL9Na4u4OVx8KtW8S487E9NC+cYPBHoJiJG8DJ1eWCM/xg1t6SiuKBltZW5Zc/dkdVP1o+lXhIJDHwRGRAdFnNad5Z53W0078RhlryZMRQ=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: from pps.filterd (m0360072.ppops.net [127.0.0.1])
	by mx0a-001b2d01.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 3B197lb9006349;
	Fri, 1 Dec 2023 09:10:20 GMT
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=message-id : date :
 subject : to : cc : references : from : in-reply-to : content-type :
 content-transfer-encoding : mime-version; s=pp1;
 bh=b4iwHtrd5wGHTHoTa7XXRaslopAcXJq/2+QsTio8Rzw=;
 b=fJ1f+icFMW3kleVqa5B8WCb1skFnbAQ04k4P+M170zw3pXVQ5ifNU0Og+K/50ycAiP+b
 cbxE2PNa8GjXcLqN3Zv84uRxrmGccfkfIeVOLHlXXjbgSJX6JZlOTZR0nekg66Pfp3uj
 de/7sGPXbQdbxQLBoEqS2ETnceZqZA24hkazIUD6XrTg8gVom/0ugMVR1gi4qLqVmcxj
 zrTu3yKfoDCoC5tvNEbfPtdqoWCJ1M3nELsJy6Svwfkhn//cAfbpUNU5QCzkM1Y1tmZi
 9uL6qDaa2r1HYDNhJVZQ9lWXmLoO3GHXOd1InSoD8N9xhedgqBLUakOrm8TTO8XBSO8s 1g== 
Received: from pps.reinject (localhost [127.0.0.1])
	by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3uqck5g2rf-1
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
	Fri, 01 Dec 2023 09:10:19 +0000
Received: from m0360072.ppops.net (m0360072.ppops.net [127.0.0.1])
	by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 3B1989kV007299;
	Fri, 1 Dec 2023 09:10:19 GMT
Received: from ppma21.wdc07v.mail.ibm.com (5b.69.3da9.ip4.static.sl-reverse.com [169.61.105.91])
	by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3uqck5g2r1-1
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
	Fri, 01 Dec 2023 09:10:19 +0000
Received: from pps.filterd (ppma21.wdc07v.mail.ibm.com [127.0.0.1])
	by ppma21.wdc07v.mail.ibm.com (8.17.1.19/8.17.1.19) with ESMTP id 3B194ShO031293;
	Fri, 1 Dec 2023 09:10:18 GMT
Received: from smtprelay07.dal12v.mail.ibm.com ([172.16.1.9])
	by ppma21.wdc07v.mail.ibm.com (PPS) with ESMTPS id 3ukv8p41ru-1
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
	Fri, 01 Dec 2023 09:10:18 +0000
Received: from smtpav03.dal12v.mail.ibm.com (smtpav03.dal12v.mail.ibm.com [10.241.53.102])
	by smtprelay07.dal12v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 3B19AHVT50135698
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
	Fri, 1 Dec 2023 09:10:17 GMT
Received: from smtpav03.dal12v.mail.ibm.com (unknown [127.0.0.1])
	by IMSVA (Postfix) with ESMTP id B990758056;
	Fri,  1 Dec 2023 09:10:17 +0000 (GMT)
Received: from smtpav03.dal12v.mail.ibm.com (unknown [127.0.0.1])
	by IMSVA (Postfix) with ESMTP id E89AA5805A;
	Fri,  1 Dec 2023 09:10:14 +0000 (GMT)
Received: from [9.43.32.20] (unknown [9.43.32.20])
	by smtpav03.dal12v.mail.ibm.com (Postfix) with ESMTP;
	Fri,  1 Dec 2023 09:10:14 +0000 (GMT)
Message-ID: <d262c40e-428b-4132-9dd1-0fa848dafcdc@linux.ibm.com>
Date: Fri, 1 Dec 2023 14:40:13 +0530
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH v2] rs6000: Add new pass for replacement of contiguous
 addresses vector load lxv with lxvp
Content-Language: en-US
To: "Kewen.Lin" <linkw@linux.ibm.com>, GCC Patches <gcc-patches@gcc.gnu.org>
Cc: Segher Boessenkool <segher@kernel.crashing.org>,
        David Edelsohn <dje.gcc@gmail.com>,
        Peter Bergner <bergner@linux.ibm.com>,
        Michael Meissner <meissner@linux.ibm.com>
References: <ef0c54a5-c35c-3519-f062-9ac78ee66b81@linux.ibm.com>
 <77426697-1571-e180-add9-cfb6d10f8424@linux.ibm.com>
From: Ajit Agarwal <aagarwa1@linux.ibm.com>
In-Reply-To: <77426697-1571-e180-add9-cfb6d10f8424@linux.ibm.com>
Content-Type: text/plain; charset=UTF-8
X-TM-AS-GCONF: 00
X-Proofpoint-GUID: viZvQIMRpBG40Y5x8Huuw9eg4WHIZGHD
X-Proofpoint-ORIG-GUID: iYRQL8vY3tL2jw3ZFHSA1j2pZaSffizR
Content-Transfer-Encoding: 7bit
X-Proofpoint-UnRewURL: 0 URL was un-rewritten
MIME-Version: 1.0
X-Proofpoint-Virus-Version: vendor=baseguard
 engine=ICAP:2.0.272,Aquarius:18.0.997,Hydra:6.0.619,FMLib:17.11.176.26
 definitions=2023-12-01_06,2023-11-30_01,2023-05-22_02
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 bulkscore=0 phishscore=0
 mlxlogscore=999 priorityscore=1501 lowpriorityscore=0 malwarescore=0
 suspectscore=0 spamscore=0 adultscore=0 mlxscore=0 impostorscore=0
 clxscore=1015 classifier=spam adjust=0 reason=mlx scancount=1
 engine=8.12.0-2311060000 definitions=main-2312010060
X-Spam-Status: No, score=-12.8 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_EF,GIT_PATCH_0,KAM_SHORT,RCVD_IN_MSPIKE_H4,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Hello Kewen:

On 24/11/23 3:01 pm, Kewen.Lin wrote:
> Hi Ajit,
> 
> Don't forget to CC David (CC-ed) :), some comments are inlined below.
> 
> on 2023/10/8 03:04, Ajit Agarwal wrote:
>> Hello All:
>>
>> This patch add new pass to replace contiguous addresses vector load lxv with mma instruction
>> lxvp.
> 
> IMHO the current binding lxvp (and lxvpx, stxvp{x,}) to MMA looks wrong, it's only
> Power10 and VSX required, these instructions should perform well without MMA support.
> So one patch to separate their support from MMA seems to go first.
> 

I will make the changes for Power10 and VSX.

>> This patch addresses one regressions failure in ARM architecture.
> 
> Could you explain this?  I don't see any test case for this.

I have submitted v1 of the patch and there were regressions failure for Linaro.
I have fixed in version V2.
> 
>> Bootstrapped and regtested with powepc64-linux-gnu.
>>
>> Thanks & Regards
>> Ajit
>>
>>
>> rs6000: Add new pass for replacement of contiguous lxv with lxvp.
>>
>> New pass to replace contiguous addresses lxv with lxvp. This pass
>> is registered after ree rtl pass.> 
>> 2023-10-07  Ajit Kumar Agarwal  <aagarwa1@linux.ibm.com>
>>
>> gcc/ChangeLog:
>>
>> 	* config/rs6000/rs6000-passes.def: Registered vecload pass.
>> 	* config/rs6000/rs6000-vecload-opt.cc: Add new pass.
>> 	* config.gcc: Add new executable.
>> 	* config/rs6000/rs6000-protos.h: Add new prototype for vecload
>> 	pass.
>> 	* config/rs6000/rs6000.cc: Add new prototype for vecload pass.
>> 	* config/rs6000/t-rs6000: Add new rule.
>>
>> gcc/testsuite/ChangeLog:
>>
>> 	* g++.target/powerpc/vecload.C: New test.
>> ---
>>  gcc/config.gcc                             |   4 +-
>>  gcc/config/rs6000/rs6000-passes.def        |   1 +
>>  gcc/config/rs6000/rs6000-protos.h          |   2 +
>>  gcc/config/rs6000/rs6000-vecload-opt.cc    | 234 +++++++++++++++++++++
>>  gcc/config/rs6000/rs6000.cc                |   3 +-
>>  gcc/config/rs6000/t-rs6000                 |   4 +
>>  gcc/testsuite/g++.target/powerpc/vecload.C |  15 ++
>>  7 files changed, 260 insertions(+), 3 deletions(-)
>>  create mode 100644 gcc/config/rs6000/rs6000-vecload-opt.cc
>>  create mode 100644 gcc/testsuite/g++.target/powerpc/vecload.C
>>
>> diff --git a/gcc/config.gcc b/gcc/config.gcc
>> index ee46d96bf62..482ab094b89 100644
>> --- a/gcc/config.gcc
>> +++ b/gcc/config.gcc
>> @@ -515,7 +515,7 @@ or1k*-*-*)
>>  	;;
>>  powerpc*-*-*)
>>  	cpu_type=rs6000
>> -	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
>> +	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o rs6000-vecload-opt.o"
>>  	extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
>>  	extra_objs="${extra_objs} rs6000-builtins.o rs6000-builtin.o"
>>  	extra_headers="ppc-asm.h altivec.h htmintrin.h htmxlintrin.h"
>> @@ -552,7 +552,7 @@ riscv*)
>>  	;;
>>  rs6000*-*-*)
>>  	extra_options="${extra_options} g.opt fused-madd.opt rs6000/rs6000-tables.opt"
>> -	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
>> +	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o rs6000-vecload-opt.o"
>>  	extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
>>  	target_gtfiles="$target_gtfiles \$(srcdir)/config/rs6000/rs6000-logue.cc \$(srcdir)/config/rs6000/rs6000-call.cc"
>>  	target_gtfiles="$target_gtfiles \$(srcdir)/config/rs6000/rs6000-pcrel-opt.cc"
>> diff --git a/gcc/config/rs6000/rs6000-passes.def b/gcc/config/rs6000/rs6000-passes.def
>> index ca899d5f7af..9ecf8ce6a9c 100644
>> --- a/gcc/config/rs6000/rs6000-passes.def
>> +++ b/gcc/config/rs6000/rs6000-passes.def
>> @@ -28,6 +28,7 @@ along with GCC; see the file COPYING3.  If not see
>>       The power8 does not have instructions that automaticaly do the byte swaps
>>       for loads and stores.  */
>>    INSERT_PASS_BEFORE (pass_cse, 1, pass_analyze_swaps);
>> +  INSERT_PASS_AFTER (pass_ree, 1, pass_analyze_vecload);
>>  
>>    /* Pass to do the PCREL_OPT optimization that combines the load of an
>>       external symbol's address along with a single load or store using that
>> diff --git a/gcc/config/rs6000/rs6000-protos.h b/gcc/config/rs6000/rs6000-protos.h
>> index f70118ea40f..9c44bae33d3 100644
>> --- a/gcc/config/rs6000/rs6000-protos.h
>> +++ b/gcc/config/rs6000/rs6000-protos.h
>> @@ -91,6 +91,7 @@ extern int mems_ok_for_quad_peep (rtx, rtx);
>>  extern bool gpr_or_gpr_p (rtx, rtx);
>>  extern bool direct_move_p (rtx, rtx);
>>  extern bool quad_address_p (rtx, machine_mode, bool);
>> +extern bool mode_supports_dq_form (machine_mode);
>>  extern bool quad_load_store_p (rtx, rtx);
>>  extern bool fusion_gpr_load_p (rtx, rtx, rtx, rtx);
>>  extern void expand_fusion_gpr_load (rtx *);
>> @@ -344,6 +345,7 @@ class rtl_opt_pass;
>>  
>>  extern rtl_opt_pass *make_pass_analyze_swaps (gcc::context *);
>>  extern rtl_opt_pass *make_pass_pcrel_opt (gcc::context *);
>> +extern rtl_opt_pass *make_pass_analyze_vecload (gcc::context *);
>>  extern bool rs6000_sum_of_two_registers_p (const_rtx expr);
>>  extern bool rs6000_quadword_masked_address_p (const_rtx exp);
>>  extern rtx rs6000_gen_lvx (enum machine_mode, rtx, rtx);
>> diff --git a/gcc/config/rs6000/rs6000-vecload-opt.cc b/gcc/config/rs6000/rs6000-vecload-opt.cc
>> new file mode 100644
>> index 00000000000..63ee733af89
>> --- /dev/null
>> +++ b/gcc/config/rs6000/rs6000-vecload-opt.cc
>> @@ -0,0 +1,234 @@
>> +/* Subroutines used to replace lxv with lxvp
>> +   for p10 little-endian VSX code.
>> +   Copyright (C) 2020-2023 Free Software Foundation, Inc.
>> +   Contributed by Ajit Kumar Agarwal <aagarwa1@linux.ibm.com>.
>> +
> 
> Some comments here describing how is this implemented are appreciated,
> it helps code readers in future and also the reviewers.  :)
> 
>> +   This file is part of GCC.
>> +
>> +   GCC is free software; you can redistribute it and/or modify it
>> +   under the terms of the GNU General Public License as published
>> +   by the Free Software Foundation; either version 3, or (at your
>> +   option) any later version.
>> +
>> +   GCC is distributed in the hope that it will be useful, but WITHOUT
>> +   ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
>> +   or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
>> +   License for more details.
>> +
>> +   You should have received a copy of the GNU General Public License
>> +   along with GCC; see the file COPYING3.  If not see
>> +   <http://www.gnu.org/licenses/>.  */
>> +
>> +#define IN_TARGET_CODE 1
>> +
>> +#include "config.h"
>> +#include "system.h"
>> +#include "coretypes.h"
>> +#include "backend.h"
>> +#include "rtl.h"
>> +#include "tree.h"
>> +#include "memmodel.h"
>> +#include "df.h"
>> +#include "tm_p.h"
>> +#include "ira.h"
>> +#include "print-tree.h"
>> +#include "varasm.h"
>> +#include "explow.h"
>> +#include "expr.h"
>> +#include "output.h"
>> +#include "tree-pass.h"
>> +#include "regs.h"
>> +#include "rtx-vector-builder.h"
>> +#include "rs6000-protos.h"
> 
> Nit: Some included header looks useless, at least the rtx-vector-builder.h,
> could you have a check and leave those necessary only.
> 

I will make the change accordingly.
>> +
>> +static inline bool
>> +quad_address_offset_p (HOST_WIDE_INT offset)
>> +{
>> +  return (IN_RANGE (offset, -32768, 32767) && ((offset) & 0xf) == 0);
>> +}
> 
> This same implementation is in rs6000-internal.h, so just include it?
> 
>> +
>> +/* Replace identified lxv with lxvp.  */
>> +static void
>> +replace_lxv_with_lxvp (rtx_insn *insn1, rtx_insn *insn2)
>> +{
>> +  rtx body = PATTERN (insn1);
>> +  rtx src_exp = SET_SRC (body);
>> +  rtx dest_exp = SET_DEST (body);
>> +  rtx lxv;
>> +  rtx insn2_body = PATTERN (insn2);
>> +  rtx insn2_dest_exp = SET_DEST (insn2_body);
>> +  unsigned int regno = REGNO (dest_exp);
> 
> Nit: It seems more clear using body{1,2}, src{1,2}, dest{1,2} etc.
> 

I will make the changes.

>> +
>> +  if (regno > REGNO (insn2_dest_exp))
>> +    {
>> +      df_set_regs_ever_live (REGNO (dest_exp), false);
>> +      df_set_regs_ever_live (REGNO (insn2_dest_exp), true);
>> +      SET_REGNO (dest_exp, REGNO (insn2_dest_exp));
>> +      dest_exp->used = 1;
>> +      df_set_regs_ever_live (REGNO (insn2_dest_exp), false);
>> +      df_set_regs_ever_live (regno, true);
>> +      SET_REGNO (insn2_dest_exp, regno);
>> +      insn2_dest_exp->used = 1;
>> +    }
> 
> Why do we need this change?  I think a descriptive comment is appreciated
> for the above hunk.
> 

I will add descriptive comment.

>> +  rtx opnd = gen_rtx_REG (OOmode, REGNO (dest_exp));
>> +  PUT_MODE (src_exp, OOmode);
>> +  lxv = gen_movoo (opnd, src_exp);
>> +  rtx_insn *new_insn = emit_insn_before (lxv,  insn1);
>> +  set_block_for_insn (new_insn, BLOCK_FOR_INSN (insn1));
>> +  df_insn_rescan (new_insn);
>> +  
>> +  if (dump_file)
>> +    {
>> +      unsigned int new_uid = INSN_UID (new_insn);
>> +      fprintf (dump_file, "Replacing lxv %d with lxvp  %d\n",
>> +			  INSN_UID (insn1), new_uid);
>> +    }
>> +  df_insn_delete (insn1);
>> +  remove_insn (insn1);
>> +  df_insn_delete (insn2);
>> +  remove_insn (insn2);
>> +  insn1->set_deleted ();
>> +  insn2->set_deleted ();
>> +}
>> +
>> +/* Identify lxv instruction that are candidate of continguous
>> +   addresses and replace them with mma instruction lxvp.  */
>> +unsigned int
>> +rs6000_analyze_vecload (function *fun)
>> +{
>> +  basic_block bb;
>> +  rtx_insn *insn, *curr_insn = 0;
>> +  rtx_insn *insn1 = 0, *insn2 = 0;
>> +  bool first_vec_insn = false;
>> +  unsigned int offset = 0;
>> +  unsigned int regno = 0;
>> +
>> +  FOR_ALL_BB_FN (bb, fun)
>> +    FOR_BB_INSNS_SAFE (bb, insn, curr_insn)
>> +    {
>> +      if (NONDEBUG_INSN_P (insn) && GET_CODE (PATTERN (insn)) == SET)
>> +	{
>> +	  rtx set = single_set (insn);
>> +	  rtx src = SET_SRC (set);
>> +	  machine_mode mode = GET_MODE (SET_DEST (set));
>> +	  bool dest_fp_p, dest_vmx_p, dest_vsx_p = false;
>> +	  rtx dest = SET_DEST (PATTERN (insn));
>> +	  int dest_regno;
>> +
>> +	  if (REG_P (dest))
>> +	    {
>> +	      dest_regno = REGNO (dest);
>> +	      dest_fp_p = FP_REGNO_P (dest_regno);
>> +	      dest_vmx_p = ALTIVEC_REGNO_P (dest_regno);
>> +	      dest_vsx_p = dest_fp_p | dest_vmx_p;
> 
> Nit: it looks dest_vsx_p is enough, don't need separate bool for
> dest_vmx_p and dest_fp_p.
> 
>> +	    }
>> +	   else
>> +	     {
>> +	       dest_regno = -1;
>> +	       dest_fp_p = dest_vmx_p = dest_vsx_p = false;
> 
> Just "continue;" for !REG_P (dest) as the below code only takes
> care of "dest_regno >= 0".
> 
>> +	     }
>> +
>> +	   if (TARGET_VSX && TARGET_MMA && dest_vsx_p)
> 
> Nit: Duplicated condition, TARGET_MMA requires TARGET_VSX.
> 
>> +	     {
>> +	       if (mode_supports_dq_form (mode)
>> +		   && dest_regno >= 0 && MEM_P (src)
> 
> dest_regno >= 0 becomes useless if we continue early;
> 
>> +		   && quad_address_p (XEXP (src, 0), mode, true))
>> +		  {
>> +		    if (first_vec_insn)
>> +		      {
>> +			rtx addr = XEXP (src, 0);
>> +			insn2 = insn;
>> +
>> +			if (GET_CODE (addr) != PLUS)
>> +			  return false;
> 
> This looks unexpected to me, when we encounter one possible candidate as
> first vec insn, it can be still unrelated to the others, if we directly
> return early, we can miss some candidates behind.
>

I will make the changes accordingly.
 
> This is also applied for other "return false"s below.
> 
>> +
>> +			rtx op0 = XEXP (addr, 0);
>> +			if (!REG_P (op0) || !INT_REG_OK_FOR_BASE_P (op0, true))
>> +			  return false;
>> +
>> +			rtx op1 = XEXP (addr, 1);
>> +			if (!CONST_INT_P (op1))
>> +			  return false;
>> +
>> +			mem_attrs attrs (*get_mem_attrs (src));
>> +			bool reg_attrs_found = false;
>> +
>> +			if (REG_P (dest) && REG_ATTRS (dest))
>> +			  {
>> +			    poly_int64 off = REG_ATTRS (dest)->offset;
>> +			    if (known_ge (off, 0))
>> +			      reg_attrs_found = true;
>> +			  }
>> +			if ((attrs.offset_known_p && known_ge (attrs.offset, 0))
>> +			     && reg_attrs_found
>> +			     &&  quad_address_offset_p (INTVAL (op1))
>> +			     && (regno == REGNO (op0))
>> +			     && ((INTVAL (op1) - offset) == 16))
>> +			   {
>> +			     replace_lxv_with_lxvp (insn1, insn2);
>> +			     return true;
> 
> This function is directly used for return in pass_analyze_vecload::execute,
> the return value of pass...::execute is for TODO flags, "return true" is
> appending flag to it.  If you want to record some information like if this
> function changes something actually, you can add one variable "bool changed"
> in pass_analyze_vecload::execute and keep the return value as 0 there.
> 

I will make the changes accordingly.

>> +			   }
> 
> Here it seems to me that this handlings only takes care of the first indirect
> addressing and the second DQ-form addressing, but we should be able to handle
> two DQ-form or two X-form as well if we know they are adjacent as expected.
>

I will make the change.

 
> Besides, it seems a bad idea to put this pass after reload? as register allocation
> finishes, this pairing has to be restricted by the reg No. (I didn't see any
> checking on the reg No. relationship for paring btw.)
>

Adding before reload pass deletes one of the lxv and replaced with lxvp. This
fails in reload pass while freeing reg_eqivs as ira populates them and then
vecload pass deletes some of insns and while freeing in reload pass as insn
is already deleted in vecload pass reload pass segfaults.

Moving vecload pass before ira will not make register pairs with lxvp and
in ira and that will be a problem.

Making after reload pass is the only solution I see as ira and reload pass
makes register pairs and vecload pass will be easier with generation of
lxvp.

Please suggest.
 
> Looking forward to the comments from Segher/David/Peter/Mike etc.
> 
>> +			}
>> +			if (REG_P (XEXP (src, 0))
> 
> Formatting issue, wrong indentation for this "if" hunk.
> 
>> +			    && GET_CODE (XEXP (src, 0)) != PLUS)
>> +			  {
>> +			    mem_attrs attrs (*get_mem_attrs (src));
>> +			    if (attrs.offset_known_p)
>> +			      offset = attrs.offset;
>> +			    if (offset == 0 && REG_P (dest) && REG_ATTRS (dest))
>> +			      offset = REG_ATTRS (dest)->offset;
>> +			    regno = REGNO (XEXP (src,0));
>> +			    first_vec_insn = true;
>> +			    insn1 = insn;
>> +			 }
>> +		     }
>> +	       }
>> +	}
>> +    }
>> +  return false;
>> +}
>> +
>> +const pass_data pass_data_analyze_vecload =
>> +{
>> +  RTL_PASS, /* type */
>> +  "vecload", /* name */
>> +  OPTGROUP_NONE, /* optinfo_flags */
>> +  TV_NONE, /* tv_id */
>> +  0, /* properties_required */
>> +  0, /* properties_provided */
>> +  0, /* properties_destroyed */
>> +  0, /* todo_flags_start */
>> +  TODO_df_finish, /* todo_flags_finish */
>> +};
>> +
>> +class pass_analyze_vecload : public rtl_opt_pass
>> +{
>> +public:
>> +  pass_analyze_vecload(gcc::context *ctxt)
>> +    : rtl_opt_pass(pass_data_analyze_vecload, ctxt)
>> +  {}
>> +
>> +  /* opt_pass methods: */
>> +  virtual bool gate (function *)
>> +    {
>> +      return (optimize > 0 && TARGET_VSX);
>> +    }
>> +
> 
> It should be gated with Power10 as well.
> 
>> +  virtual unsigned int execute (function *fun)
>> +    {
>> +      return rs6000_analyze_vecload (fun);
> 
> As the comment on "return true" in rs6000_analyze_vecload, the return value
> of this function is for TODO flags.
> 
> "...
> The return value contains TODOs to execute in addition to those in
> TODO_flags_finish.   */"
> 
>> +    }
>> +
>> +  opt_pass *clone ()
>> +    {
>> +      return new pass_analyze_vecload (m_ctxt);
>> +    }
> 
> I guess this was copied from rs6000 swap pass, but
> the clone here isn't needed actually since it's only
> one instance.
> 

I will make the changes.

>> +
>> +}; // class pass_analyze_vecload
>> +
>> +rtl_opt_pass *
>> +make_pass_analyze_vecload (gcc::context *ctxt)
>> +{
>> +  return new pass_analyze_vecload (ctxt);
>> +}
>> +
>> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
>> index cc9253bb040..dba545271e0 100644
>> --- a/gcc/config/rs6000/rs6000.cc
>> +++ b/gcc/config/rs6000/rs6000.cc
>> @@ -387,7 +387,7 @@ mode_supports_vmx_dform (machine_mode mode)
>>  /* Return true if we have D-form addressing in VSX registers.  This addressing
>>     is more limited than normal d-form addressing in that the offset must be
>>     aligned on a 16-byte boundary.  */
>> -static inline bool
>> +bool
>>  mode_supports_dq_form (machine_mode mode)
>>  {
>>    return ((reg_addr[mode].addr_mask[RELOAD_REG_ANY] & RELOAD_REG_QUAD_OFFSET)
>> @@ -1178,6 +1178,7 @@ static bool rs6000_secondary_reload_move (enum rs6000_reg_type,
>>  					  secondary_reload_info *,
>>  					  bool);
>>  rtl_opt_pass *make_pass_analyze_swaps (gcc::context*);
>> +rtl_opt_pass *make_pass_analyze_vecload (gcc::context*);
>>  
>>  /* Hash table stuff for keeping track of TOC entries.  */
>>  
>> diff --git a/gcc/config/rs6000/t-rs6000 b/gcc/config/rs6000/t-rs6000
>> index f183b42ce1d..da7ae26e88b 100644
>> --- a/gcc/config/rs6000/t-rs6000
>> +++ b/gcc/config/rs6000/t-rs6000
>> @@ -47,6 +47,10 @@ rs6000-builtin.o: $(srcdir)/config/rs6000/rs6000-builtin.cc
>>  	$(COMPILE) $<
>>  	$(POSTCOMPILE)
>>  
>> +rs6000-vecload-opt.o: $(srcdir)/config/rs6000/rs6000-vecload-opt.cc
>> +	$(COMPILE) $<
>> +	$(POSTCOMPILE)
>> +
>>  build/rs6000-gen-builtins.o: $(srcdir)/config/rs6000/rs6000-gen-builtins.cc
>>  build/rbtree.o: $(srcdir)/config/rs6000/rbtree.cc
>>  
>> diff --git a/gcc/testsuite/g++.target/powerpc/vecload.C b/gcc/testsuite/g++.target/powerpc/vecload.C
>> new file mode 100644
>> index 00000000000..f1689ad6522
>> --- /dev/null
>> +++ b/gcc/testsuite/g++.target/powerpc/vecload.C
>> @@ -0,0 +1,15 @@
>> +/* { dg-do compile } */ 
>> +/* { dg-require-effective-target powerpc_p9vector_ok } */
>> +/* { dg-options "-mdejagnu-cpu=power10 -O2 -mmma" } */ 
>> +
>> +#include <altivec.h>
>> +
>> +void
>> +foo (__vector_quad *dst, vector unsigned char *ptr, vector unsigned char src)
>> +{
>> +  __vector_quad acc;
>> +  __builtin_mma_xvf32ger(&acc, src, ptr[0]);
>> +  __builtin_mma_xvf32gerpp(&acc, src, ptr[1]);
>> +  *dst = acc;
>> +}
>> +/* { dg-final { scan-assembler {\mlxvp\M} } } */
> 
> Maybe some more test cases as below can be considered and make this pass better:
> 
> void
> foo1 (__vector_quad *dst, vector unsigned char *ptr, vector unsigned char src)
> {
>   __vector_quad acc;
>   __builtin_mma_xvf32ger(&acc, src, ptr[1]);
>   __builtin_mma_xvf32gerpp(&acc, src, ptr[2]);
>   *dst = acc;
> }
> 
> void
> foo2 (__vector_quad *dst1, __vector_quad *dst2, vector unsigned char *ptr, vector unsigned char src)
> {
>   __vector_quad acc;
>   __builtin_mma_xvf32ger(&acc, src, ptr[0]);
>   __builtin_mma_xvf32gerpp(&acc, src, ptr[1]);
>   *dst1 = acc;
>   __builtin_mma_xvf32ger(&acc, src, ptr[2]);
>   __builtin_mma_xvf32gerpp(&acc, src, ptr[3]);
>   *dst2 = acc;
> }
>

I will add the above tests.

Thanks & Regards
Ajit
 
> BR,
> Kewen