From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=nbgL=DR=arm.com=richard.sandiford@sourceware.org>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by sourceware.org (Postfix) with ESMTP id 522C53858416
	for <gcc-patches@gcc.gnu.org>; Mon, 31 Jul 2023 09:11:52 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 522C53858416
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 33245D75;
	Mon, 31 Jul 2023 02:12:35 -0700 (PDT)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 378003F59C;
	Mon, 31 Jul 2023 02:11:51 -0700 (PDT)
From: Richard Sandiford <richard.sandiford@arm.com>
To: Hao Liu OS <hliu@os.amperecomputing.com>
Mail-Followup-To: Hao Liu OS <hliu@os.amperecomputing.com>,Richard Biener <richard.guenther@gmail.com>,  "GCC-patches\@gcc.gnu.org" <gcc-patches@gcc.gnu.org>, richard.sandiford@arm.com
Cc: Richard Biener <richard.guenther@gmail.com>,  "GCC-patches\@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH] AArch64: Do not increase the vect reduction latency by multiplying count [PR110625]
References: <SJ2PR01MB8635742E07E2076FA2BE0560E139A@SJ2PR01MB8635.prod.exchangelabs.com>
	<mptr0oxh8eb.fsf@arm.com>
	<SJ2PR01MB8635D1095B5BDDAE1CA19877E103A@SJ2PR01MB8635.prod.exchangelabs.com>
	<mptsf9cfhqz.fsf@arm.com>
	<SJ2PR01MB86352C18F941F7EEB65C1B55E100A@SJ2PR01MB8635.prod.exchangelabs.com>
	<CAFiYyc2pyQTVy++qbuAPhq8FNLo+k5ysEdj0JG2eyLzgsFWCdA@mail.gmail.com>
	<mptv8e7c9wi.fsf@arm.com>
	<SJ2PR01MB86359037D13E46D58931FBC6E100A@SJ2PR01MB8635.prod.exchangelabs.com>
	<mptsf98aqig.fsf@arm.com>
	<SJ2PR01MB863577E6A62DB68DC5E73726E105A@SJ2PR01MB8635.prod.exchangelabs.com>
Date: Mon, 31 Jul 2023 10:11:50 +0100
In-Reply-To: <SJ2PR01MB863577E6A62DB68DC5E73726E105A@SJ2PR01MB8635.prod.exchangelabs.com>
	(Hao Liu's message of "Mon, 31 Jul 2023 02:39:16 +0000")
Message-ID: <mptila0bg2x.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
X-Spam-Status: No, score=-20.3 required=5.0 tests=BAYES_00,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,KAM_SHORT,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Hao Liu OS <hliu@os.amperecomputing.com> writes:
>> Which test case do you see this for?  The two tests in the patch still
>> seem to report correct latencies for me if I make the change above.
>
> Not the newly added tests.  It is still the existing case causing the previous ICE (i.e. assertion problem): gcc.target/aarch64/sve/cost_model_13.c.
>
> It's not the test case itself failed, but the dump message of vect says the "reduction latency" is 0:
>
> Before the change:
> cost_model_13.c:7:21: note:  Original vector body cost = 6
> cost_model_13.c:7:21: note:  Scalar issue estimate:
> cost_model_13.c:7:21: note:    load operations = 1
> cost_model_13.c:7:21: note:    store operations = 0
> cost_model_13.c:7:21: note:    general operations = 1
> cost_model_13.c:7:21: note:    reduction latency = 1
> cost_model_13.c:7:21: note:    estimated min cycles per iteration = 1.000000
> cost_model_13.c:7:21: note:    estimated cycles per vector iteration (for VF 8) = 8.000000
> cost_model_13.c:7:21: note:  Vector issue estimate:
> cost_model_13.c:7:21: note:    load operations = 1
> cost_model_13.c:7:21: note:    store operations = 0
> cost_model_13.c:7:21: note:    general operations = 1
> cost_model_13.c:7:21: note:    reduction latency = 2
> cost_model_13.c:7:21: note:    estimated min cycles per iteration = 2.000000
>
> After the change:
> cost_model_13.c:7:21: note:  Original vector body cost = 6
> cost_model_13.c:7:21: note:  Scalar issue estimate:
> cost_model_13.c:7:21: note:    load operations = 1
> cost_model_13.c:7:21: note:    store operations = 0
> cost_model_13.c:7:21: note:    general operations = 1
> cost_model_13.c:7:21: note:    reduction latency = 0         <--- seems not consistent with above result
> cost_model_13.c:7:21: note:    estimated min cycles per iteration = 1.000000
> cost_model_13.c:7:21: note:    estimated cycles per vector iteration (for VF 8) = 8.000000
> cost_model_13.c:7:21: note:  Vector issue estimate:
> cost_model_13.c:7:21: note:    load operations = 1
> cost_model_13.c:7:21: note:    store operations = 0
> cost_model_13.c:7:21: note:    general operations = 1
> cost_model_13.c:7:21: note:    reduction latency = 0         <--- seems not consistent with above result
> cost_model_13.c:7:21: note:    estimated min cycles per iteration = 1.000000         <--- seems not consistent with above result
>
> BTW. this should be caused by the reduction stmt is not live, which indicates whether this stmts is part of a computation whose result is used outside the loop (tree-vectorized.h:1204):
>   <bb 3>:
>   # res_18 = PHI <res_15(7), 0(6)>
>   # i_20 = PHI <i_16(7), 0(6)>
>   _1 = (long unsigned int) i_20;
>   _2 = _1 * 2;
>   _3 = x_14(D) + _2;
>   _4 = *_3;
>   _5 = (unsigned short) _4;
>   res.0_6 = (unsigned short) res_18;
>   _7 = _5 + res.0_6;                             <-- This is not live, may be caused by the below type cast stmt.
>   res_15 = (short int) _7;
>   i_16 = i_20 + 1;
>   if (n_11(D) > i_16)
>     goto <bb 7>;
>   else
>     goto <bb 4>;
>
>   <bb 7>:
>   goto <bb 3>;

Ah, I see, thanks.  My concern was: if requiring !STMT_VINFO_LIVE_P stmts
can cause "normal" reductions to have a latency of 0, could the same thing
happen for single-cycle reductions?  But I suppose the answer is "no".
Introducing a cast like the above would cause reduc_chain_length > 1,
and so:

  if (ncopies > 1
      && (STMT_VINFO_RELEVANT (stmt_info) <= vect_used_only_live)
      && reduc_chain_length == 1
      && loop_vinfo->suggested_unroll_factor == 1)
    single_defuse_cycle = true;

wouldn't trigger.  Which makes the single-cycle thing a bit hit-and-miss...

So yeah, I agree the patch is safe after all.

Please split the check out into a helper though, to avoid the awkward
formatting:

/* Return true if STMT_INFO is part of a reduction that has the form:

      r = r op ...;
      r = r op ...;

   with the single accumulator being read and written multiple times.  */
static bool
aarch64_force_single_cycle (vec_info *vinfo, stmt_vec_info stmt_info)
{
  if (!STMT_VINFO_LIVE_P (stmt_info))
    return false;

  auto reduc_info = info_for_reduction (vinfo, stmt_info);
  return STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info);
}

OK with that change, thanks.

Richard