From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=a7Zv=DK=arm.com=richard.sandiford@sourceware.org>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by sourceware.org (Postfix) with ESMTP id 7BE693858D32
	for <gcc-patches@gcc.gnu.org>; Mon, 24 Jul 2023 11:10:59 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 7BE693858D32
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id E41D5DE0;
	Mon, 24 Jul 2023 04:11:41 -0700 (PDT)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 498053F5A1;
	Mon, 24 Jul 2023 04:10:58 -0700 (PDT)
From: Richard Sandiford <richard.sandiford@arm.com>
To: Hao Liu OS <hliu@os.amperecomputing.com>
Mail-Followup-To: Hao Liu OS <hliu@os.amperecomputing.com>,"GCC-patches\@gcc.gnu.org" <gcc-patches@gcc.gnu.org>, richard.sandiford@arm.com
Cc: "GCC-patches\@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH] AArch64: Do not increase the vect reduction latency by multiplying count [PR110625]
References: <SJ2PR01MB8635742E07E2076FA2BE0560E139A@SJ2PR01MB8635.prod.exchangelabs.com>
Date: Mon, 24 Jul 2023 12:10:52 +0100
In-Reply-To: <SJ2PR01MB8635742E07E2076FA2BE0560E139A@SJ2PR01MB8635.prod.exchangelabs.com>
	(Hao Liu's message of "Wed, 19 Jul 2023 04:33:48 +0000")
Message-ID: <mptr0oxh8eb.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
X-Spam-Status: No, score=-26.5 required=5.0 tests=BAYES_00,GIT_PATCH_0,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,KAM_SHORT,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Hao Liu OS <hliu@os.amperecomputing.com> writes:
> This only affects the new costs in aarch64 backend.  Currently, the reduction
> latency of vector body is too large as it is multiplied by stmt count.  As the
> scalar reduction latency is small, the new costs model may think "scalar code
> would issue more quickly" and increase the vector body cost a lot, which will
> miss vectorization opportunities.
>
> Tested by bootstrapping on aarch64-linux-gnu.
>
> gcc/ChangeLog:
>
> 	PR target/110625
> 	* config/aarch64/aarch64.cc (count_ops): Remove the '* count'
> 	for reduction_latency.
>
> gcc/testsuite/ChangeLog:
>
> 	* gcc.target/aarch64/pr110625.c: New testcase.
> ---
>  gcc/config/aarch64/aarch64.cc               |  5 +--
>  gcc/testsuite/gcc.target/aarch64/pr110625.c | 46 +++++++++++++++++++++
>  2 files changed, 47 insertions(+), 4 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110625.c
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 560e5431636..27afa64b7d5 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -16788,10 +16788,7 @@ aarch64_vector_costs::count_ops (unsigned int count, vect_cost_for_stmt kind,
>      {
>        unsigned int base
>  	= aarch64_in_loop_reduction_latency (m_vinfo, stmt_info, m_vec_flags);
> -
> -      /* ??? Ideally we'd do COUNT reductions in parallel, but unfortunately
> -	 that's not yet the case.  */
> -      ops->reduction_latency = MAX (ops->reduction_latency, base * count);
> +      ops->reduction_latency = MAX (ops->reduction_latency, base);

The ??? is referring to the single_defuse_cycle code in
vectorizable_reduction.  E.g. consider:

long
f (long res, short *ptr1, short *ptr2, int n) {
  for (int i = 0; i < n; ++i)
    res += (long) ptr1[i] << ptr2[i];
  return res;
}

compiled at -O3.  The main loop is:

        movi    v25.4s, 0			// init accumulator
        lsl     x5, x5, 4
        .p2align 3,,7
.L4:
        ldr     q31, [x1, x4]
        ldr     q29, [x2, x4]
        add     x4, x4, 16
        sxtl    v30.4s, v31.4h
        sxtl2   v31.4s, v31.8h
        sxtl    v28.4s, v29.4h
        sxtl2   v29.4s, v29.8h
        sxtl    v27.2d, v30.2s
        sxtl2   v30.2d, v30.4s
        sxtl    v23.2d, v28.2s
        sxtl2   v26.2d, v28.4s
        sxtl    v24.2d, v29.2s
        sxtl    v28.2d, v31.2s
        sshl    v27.2d, v27.2d, v23.2d
        sshl    v30.2d, v30.2d, v26.2d
        sxtl2   v31.2d, v31.4s
        sshl    v28.2d, v28.2d, v24.2d
        add     v27.2d, v27.2d, v25.2d		// v25 -> v27
        sxtl2   v29.2d, v29.4s
        add     v30.2d, v30.2d, v27.2d		// v27 -> v30
        sshl    v31.2d, v31.2d, v29.2d
        add     v30.2d, v28.2d, v30.2d		// v30 -> v30
        add     v25.2d, v31.2d, v30.2d		// v30 -> v25
        cmp     x4, x5
        bne     .L4

Here count is 4 and the latency is 4 additions (8 cycles).  So as a
worst case, the current cost is correct.

To remove the count in all cases, we would need to

(1) avoid single def-use cycles or

(2) reassociate the reduction as a tree, (4->2, 2->1, 1+acc->acc)

But looking again, it seems we do have the information to distinguish
the cases.  We can do something like:

      stmt_vec_info reduc_info = info_for_reduction (m_vinfo, stmt_info);
      if (STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info))
	/* ??? Ideally we'd use a tree to reduce the copies down to 1 vector,
	   and then accumulate that, but at the moment the loop-carried
	   dependency includes all copies.  */
        ops->reduction_latency = MAX (ops->reduction_latency, base * count);
      else
        ops->reduction_latency = MAX (ops->reduction_latency, base);

(completely untested).

Thanks,
Richard