From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id CC2413858C5E for ; Thu, 3 Aug 2023 10:10:04 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org CC2413858C5E Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 350C7113E; Thu, 3 Aug 2023 03:10:47 -0700 (PDT) Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id A21983F5A1; Thu, 3 Aug 2023 03:10:03 -0700 (PDT) From: Richard Sandiford To: Hao Liu OS Mail-Followup-To: Hao Liu OS ,Richard Biener , "GCC-patches\@gcc.gnu.org" , richard.sandiford@arm.com Cc: Richard Biener , "GCC-patches\@gcc.gnu.org" Subject: Re: [PATCH] AArch64: Do not increase the vect reduction latency by multiplying count [PR110625] References: Date: Thu, 03 Aug 2023 11:10:02 +0100 In-Reply-To: (Hao Liu's message of "Wed, 2 Aug 2023 03:45:49 +0000") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Spam-Status: No, score=-25.8 required=5.0 tests=BAYES_00,GIT_PATCH_0,KAM_ASCII_DIVIDERS,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,KAM_SHORT,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Hao Liu OS writes: > Hi Richard, > > Update the patch with a simple case (see below case and comments). It shows a live stmt may not have reduction def, which introduce the ICE. > > Is it OK for trunk? OK, thanks. Richard > ---- > Fix the assertion failure on empty reduction define in info_for_reduction. > Even a stmt is live, it may still have empty reduction define. Check the > reduction definition instead of live info before calling info_for_reduction. > > gcc/ChangeLog: > > PR target/110625 > * config/aarch64/aarch64.cc (aarch64_force_single_cycle): check > STMT_VINFO_REDUC_DEF to avoid failures in info_for_reduction. > > gcc/testsuite/ChangeLog: > > * gcc.target/aarch64/pr110625_3.c: New testcase. > --- > gcc/config/aarch64/aarch64.cc | 2 +- > gcc/testsuite/gcc.target/aarch64/pr110625_3.c | 34 +++++++++++++++++++ > 2 files changed, 35 insertions(+), 1 deletion(-) > create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110625_3.c > > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc > index d4d76025545..5b8d8fa8e2d 100644 > --- a/gcc/config/aarch64/aarch64.cc > +++ b/gcc/config/aarch64/aarch64.cc > @@ -16776,7 +16776,7 @@ aarch64_adjust_stmt_cost (vect_cost_for_stmt kind, stmt_vec_info stmt_info, > static bool > aarch64_force_single_cycle (vec_info *vinfo, stmt_vec_info stmt_info) > { > - if (!STMT_VINFO_LIVE_P (stmt_info)) > + if (!STMT_VINFO_REDUC_DEF (stmt_info)) > return false; > > auto reduc_info = info_for_reduction (vinfo, stmt_info); > diff --git a/gcc/testsuite/gcc.target/aarch64/pr110625_3.c b/gcc/testsuite/gcc.target/aarch64/pr110625_3.c > new file mode 100644 > index 00000000000..35a50290cb0 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/pr110625_3.c > @@ -0,0 +1,34 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O3 -mcpu=neoverse-n2" } */ > + > +/* Avoid ICE on empty reduction def in single_defuse_cycle. > + > + E.g. > + [local count: 858993456]: > + # sum_18 = PHI > + sum.0_5 = (unsigned int) sum_18; > + _6 = _4 + sum.0_5; <-- it is "live" but doesn't have reduction def > + sum_15 = (int) _6; > + ... > + if (ivtmp_29 != 0) > + goto ; [75.00%] > + else > + goto ; [25.00%] > + > + [local count: 644245086]: > + goto ; [100.00%] > + > + [local count: 214748368]: > + # _31 = PHI <_6(3)> > + _8 = _31 >> 1; > +*/ > + > +int > +f (unsigned int *tmp) > +{ > + int sum = 0; > + for (int i = 0; i < 4; i++) > + sum += tmp[i]; > + > + return (unsigned int) sum >> 1; > +} > -- > 2.34.1 > > ________________________________________ > From: Hao Liu OS > Sent: Tuesday, August 1, 2023 17:43 > To: Richard Sandiford > Cc: Richard Biener; GCC-patches@gcc.gnu.org > Subject: Re: [PATCH] AArch64: Do not increase the vect reduction latency by multiplying count [PR110625] > > Hi Richard, > > This is a quick fix to the several ICEs. It seems even STMT_VINFO_LIVE_P is true, some reduct stmts still don't have REDUC_DEF. So I change the check to STMT_VINFO_REDUC_DEF. > > Is it OK for trunk? > > --- > Fix the ICEs on empty reduction define. Even STMT_VINFO_LIVE_P is true, some reduct stmts > still don't have definition. > > gcc/ChangeLog: > > PR target/110625 > * config/aarch64/aarch64.cc (aarch64_force_single_cycle): check > STMT_VINFO_REDUC_DEF to avoid failures in info_for_reduction > --- > gcc/config/aarch64/aarch64.cc | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc > index d4d76025545..5b8d8fa8e2d 100644 > --- a/gcc/config/aarch64/aarch64.cc > +++ b/gcc/config/aarch64/aarch64.cc > @@ -16776,7 +16776,7 @@ aarch64_adjust_stmt_cost (vect_cost_for_stmt kind, stmt_vec_info stmt_info, > static bool > aarch64_force_single_cycle (vec_info *vinfo, stmt_vec_info stmt_info) > { > - if (!STMT_VINFO_LIVE_P (stmt_info)) > + if (!STMT_VINFO_REDUC_DEF (stmt_info)) > return false; > > auto reduc_info = info_for_reduction (vinfo, stmt_info); > -- > 2.40.0 > > > ________________________________________ > From: Richard Sandiford > Sent: Monday, July 31, 2023 17:11 > To: Hao Liu OS > Cc: Richard Biener; GCC-patches@gcc.gnu.org > Subject: Re: [PATCH] AArch64: Do not increase the vect reduction latency by multiplying count [PR110625] > > Hao Liu OS writes: >>> Which test case do you see this for? The two tests in the patch still >>> seem to report correct latencies for me if I make the change above. >> >> Not the newly added tests. It is still the existing case causing the previous ICE (i.e. assertion problem): gcc.target/aarch64/sve/cost_model_13.c. >> >> It's not the test case itself failed, but the dump message of vect says the "reduction latency" is 0: >> >> Before the change: >> cost_model_13.c:7:21: note: Original vector body cost = 6 >> cost_model_13.c:7:21: note: Scalar issue estimate: >> cost_model_13.c:7:21: note: load operations = 1 >> cost_model_13.c:7:21: note: store operations = 0 >> cost_model_13.c:7:21: note: general operations = 1 >> cost_model_13.c:7:21: note: reduction latency = 1 >> cost_model_13.c:7:21: note: estimated min cycles per iteration = 1.000000 >> cost_model_13.c:7:21: note: estimated cycles per vector iteration (for VF 8) = 8.000000 >> cost_model_13.c:7:21: note: Vector issue estimate: >> cost_model_13.c:7:21: note: load operations = 1 >> cost_model_13.c:7:21: note: store operations = 0 >> cost_model_13.c:7:21: note: general operations = 1 >> cost_model_13.c:7:21: note: reduction latency = 2 >> cost_model_13.c:7:21: note: estimated min cycles per iteration = 2.000000 >> >> After the change: >> cost_model_13.c:7:21: note: Original vector body cost = 6 >> cost_model_13.c:7:21: note: Scalar issue estimate: >> cost_model_13.c:7:21: note: load operations = 1 >> cost_model_13.c:7:21: note: store operations = 0 >> cost_model_13.c:7:21: note: general operations = 1 >> cost_model_13.c:7:21: note: reduction latency = 0 <--- seems not consistent with above result >> cost_model_13.c:7:21: note: estimated min cycles per iteration = 1.000000 >> cost_model_13.c:7:21: note: estimated cycles per vector iteration (for VF 8) = 8.000000 >> cost_model_13.c:7:21: note: Vector issue estimate: >> cost_model_13.c:7:21: note: load operations = 1 >> cost_model_13.c:7:21: note: store operations = 0 >> cost_model_13.c:7:21: note: general operations = 1 >> cost_model_13.c:7:21: note: reduction latency = 0 <--- seems not consistent with above result >> cost_model_13.c:7:21: note: estimated min cycles per iteration = 1.000000 <--- seems not consistent with above result >> >> BTW. this should be caused by the reduction stmt is not live, which indicates whether this stmts is part of a computation whose result is used outside the loop (tree-vectorized.h:1204): >> : >> # res_18 = PHI >> # i_20 = PHI >> _1 = (long unsigned int) i_20; >> _2 = _1 * 2; >> _3 = x_14(D) + _2; >> _4 = *_3; >> _5 = (unsigned short) _4; >> res.0_6 = (unsigned short) res_18; >> _7 = _5 + res.0_6; <-- This is not live, may be caused by the below type cast stmt. >> res_15 = (short int) _7; >> i_16 = i_20 + 1; >> if (n_11(D) > i_16) >> goto ; >> else >> goto ; >> >> : >> goto ; > > Ah, I see, thanks. My concern was: if requiring !STMT_VINFO_LIVE_P stmts > can cause "normal" reductions to have a latency of 0, could the same thing > happen for single-cycle reductions? But I suppose the answer is "no". > Introducing a cast like the above would cause reduc_chain_length > 1, > and so: > > if (ncopies > 1 > && (STMT_VINFO_RELEVANT (stmt_info) <= vect_used_only_live) > && reduc_chain_length == 1 > && loop_vinfo->suggested_unroll_factor == 1) > single_defuse_cycle = true; > > wouldn't trigger. Which makes the single-cycle thing a bit hit-and-miss... > > So yeah, I agree the patch is safe after all. > > Please split the check out into a helper though, to avoid the awkward > formatting: > > /* Return true if STMT_INFO is part of a reduction that has the form: > > r = r op ...; > r = r op ...; > > with the single accumulator being read and written multiple times. */ > static bool > aarch64_force_single_cycle (vec_info *vinfo, stmt_vec_info stmt_info) > { > if (!STMT_VINFO_LIVE_P (stmt_info)) > return false; > > auto reduc_info = info_for_reduction (vinfo, stmt_info); > return STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info); > } > > OK with that change, thanks. > > Richard