From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id 522C53858416 for ; Mon, 31 Jul 2023 09:11:52 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 522C53858416 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 33245D75; Mon, 31 Jul 2023 02:12:35 -0700 (PDT) Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 378003F59C; Mon, 31 Jul 2023 02:11:51 -0700 (PDT) From: Richard Sandiford To: Hao Liu OS Mail-Followup-To: Hao Liu OS ,Richard Biener , "GCC-patches\@gcc.gnu.org" , richard.sandiford@arm.com Cc: Richard Biener , "GCC-patches\@gcc.gnu.org" Subject: Re: [PATCH] AArch64: Do not increase the vect reduction latency by multiplying count [PR110625] References: Date: Mon, 31 Jul 2023 10:11:50 +0100 In-Reply-To: (Hao Liu's message of "Mon, 31 Jul 2023 02:39:16 +0000") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Spam-Status: No, score=-20.3 required=5.0 tests=BAYES_00,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,KAM_SHORT,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Hao Liu OS writes: >> Which test case do you see this for? The two tests in the patch still >> seem to report correct latencies for me if I make the change above. > > Not the newly added tests. It is still the existing case causing the previous ICE (i.e. assertion problem): gcc.target/aarch64/sve/cost_model_13.c. > > It's not the test case itself failed, but the dump message of vect says the "reduction latency" is 0: > > Before the change: > cost_model_13.c:7:21: note: Original vector body cost = 6 > cost_model_13.c:7:21: note: Scalar issue estimate: > cost_model_13.c:7:21: note: load operations = 1 > cost_model_13.c:7:21: note: store operations = 0 > cost_model_13.c:7:21: note: general operations = 1 > cost_model_13.c:7:21: note: reduction latency = 1 > cost_model_13.c:7:21: note: estimated min cycles per iteration = 1.000000 > cost_model_13.c:7:21: note: estimated cycles per vector iteration (for VF 8) = 8.000000 > cost_model_13.c:7:21: note: Vector issue estimate: > cost_model_13.c:7:21: note: load operations = 1 > cost_model_13.c:7:21: note: store operations = 0 > cost_model_13.c:7:21: note: general operations = 1 > cost_model_13.c:7:21: note: reduction latency = 2 > cost_model_13.c:7:21: note: estimated min cycles per iteration = 2.000000 > > After the change: > cost_model_13.c:7:21: note: Original vector body cost = 6 > cost_model_13.c:7:21: note: Scalar issue estimate: > cost_model_13.c:7:21: note: load operations = 1 > cost_model_13.c:7:21: note: store operations = 0 > cost_model_13.c:7:21: note: general operations = 1 > cost_model_13.c:7:21: note: reduction latency = 0 <--- seems not consistent with above result > cost_model_13.c:7:21: note: estimated min cycles per iteration = 1.000000 > cost_model_13.c:7:21: note: estimated cycles per vector iteration (for VF 8) = 8.000000 > cost_model_13.c:7:21: note: Vector issue estimate: > cost_model_13.c:7:21: note: load operations = 1 > cost_model_13.c:7:21: note: store operations = 0 > cost_model_13.c:7:21: note: general operations = 1 > cost_model_13.c:7:21: note: reduction latency = 0 <--- seems not consistent with above result > cost_model_13.c:7:21: note: estimated min cycles per iteration = 1.000000 <--- seems not consistent with above result > > BTW. this should be caused by the reduction stmt is not live, which indicates whether this stmts is part of a computation whose result is used outside the loop (tree-vectorized.h:1204): > : > # res_18 = PHI > # i_20 = PHI > _1 = (long unsigned int) i_20; > _2 = _1 * 2; > _3 = x_14(D) + _2; > _4 = *_3; > _5 = (unsigned short) _4; > res.0_6 = (unsigned short) res_18; > _7 = _5 + res.0_6; <-- This is not live, may be caused by the below type cast stmt. > res_15 = (short int) _7; > i_16 = i_20 + 1; > if (n_11(D) > i_16) > goto ; > else > goto ; > > : > goto ; Ah, I see, thanks. My concern was: if requiring !STMT_VINFO_LIVE_P stmts can cause "normal" reductions to have a latency of 0, could the same thing happen for single-cycle reductions? But I suppose the answer is "no". Introducing a cast like the above would cause reduc_chain_length > 1, and so: if (ncopies > 1 && (STMT_VINFO_RELEVANT (stmt_info) <= vect_used_only_live) && reduc_chain_length == 1 && loop_vinfo->suggested_unroll_factor == 1) single_defuse_cycle = true; wouldn't trigger. Which makes the single-cycle thing a bit hit-and-miss... So yeah, I agree the patch is safe after all. Please split the check out into a helper though, to avoid the awkward formatting: /* Return true if STMT_INFO is part of a reduction that has the form: r = r op ...; r = r op ...; with the single accumulator being read and written multiple times. */ static bool aarch64_force_single_cycle (vec_info *vinfo, stmt_vec_info stmt_info) { if (!STMT_VINFO_LIVE_P (stmt_info)) return false; auto reduc_info = info_for_reduction (vinfo, stmt_info); return STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info); } OK with that change, thanks. Richard