From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp-out2.suse.de (smtp-out2.suse.de [IPv6:2001:67c:2178:6::1d]) by sourceware.org (Postfix) with ESMTPS id 9D45F3858D35 for ; Thu, 29 Jun 2023 12:28:13 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 9D45F3858D35 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=suse.de Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de Received: from relay2.suse.de (relay2.suse.de [149.44.160.134]) by smtp-out2.suse.de (Postfix) with ESMTP id CB5951FD61; Thu, 29 Jun 2023 12:28:12 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1688041692; h=from:from:reply-to:date:date:to:to:cc:cc:mime-version:mime-version: content-type:content-type; bh=QdSeVQdGXHCO+z46BTG7T9RKyfsCUcjacc1DqvxKEWk=; b=g5B3OHR9T1wokAZE0JnIcSysj1VpQKAmhoON5pmma5/QKCINwkacT3virWOT5Ocz0xmWYr KLG222d/t5ghN+9FD6F52979/ske1ilrsey/iiRusZZAZbAleBEyuKAXclmmdDAIqFAWrc Px/gD/nE10V/AePPeIC/votnHDwBeYs= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1688041692; h=from:from:reply-to:date:date:to:to:cc:cc:mime-version:mime-version: content-type:content-type; bh=QdSeVQdGXHCO+z46BTG7T9RKyfsCUcjacc1DqvxKEWk=; b=RnuX+yDOynYSEzkI0u3FOSTfirr926kkMZoOt3QbZFRKh4AGoEtzyrZjQPnsSvAO9C3ryh B+4SaL7doU1B2PBg== Received: from wotan.suse.de (wotan.suse.de [10.160.0.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by relay2.suse.de (Postfix) with ESMTPS id AF5F12C141; Thu, 29 Jun 2023 12:28:12 +0000 (UTC) Date: Thu, 29 Jun 2023 12:28:12 +0000 (UTC) From: Richard Biener To: gcc-patches@gcc.gnu.org cc: richard.sandiford@arm.com Subject: [PATCH][RFC] target/110456 - avoid loop masking with zero distance dependences User-Agent: Alpine 2.22 (LSU 394 2020-01-19) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Spam-Status: No, score=-10.5 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,GIT_PATCH_0,MISSING_MID,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Message-ID: <20230629122812.gpfnAgiFPeHAyZ9xqnRe5oPohHJgBd1exzwXQrF9k0k@z> With applying loop masking to epilogues on x86_64 AVX512 we see some significant performance regressions when evaluating SPEC CPU 2017 that are caused by store-to-load forwarding fails across outer loop iterations when the inner loop does not iterate. Consider for (j = 0; j < m; ++j) for (i = 0; i < n; ++i) a[j*n + i] += b[j*n + i]; with 'n' chosen so that the inner loop vectorized code is fully executed by the masked epilogue and that masked epilogue storing O > n elements (with elements >= n masked of course). Then the masked load performed for the next outer loop iteration will get a hit in the store queue but it obviously cannot forward so we have to wait for the store to retire. That causes a significant hit to performance especially if 'n' would have made a non-masked epilogue to fully cover 'n' as well (say n == 4 for a V4DImode epilogue), avoiding the need for store-forwarding and waiting for the retiring of the store. The following applies a very simple heuristic, disabling the use of loop masking when there's a memory reference pair with dependence distance zero. That resolves the issue (other problematic dependence distances seem to be less common at least). I have applied this heuristic in generic vectorizer code but restricted it to non-VL vector sizes. There currently isn't a way for the target to request disabling of masking only, while we can reject the vectoriztion at costing time that will not re-consider the same vector mode but without masking. It seems simply re-costing with masking disabled should be possible through, we'd just need an indication whether that should be done? Maybe always when the current vector mode is of fixed size? I wonder how SVE vectorized code behaves in these situations? The affected SPEC CPU 2017 benchmarks were 527.cam4_r and 503.bwaves_r though I think both will need a hardware vector size covering at least 8 doubles to show the issue. 527.cam4_r has 4 elements in the inner loop, 503.bwaves_r 5 IIRC. Bootstrap / regtest running on x86_64-unknown-linux-gnu. Any comments? Thanks, Richard. PR target/110456 * tree-vectorizer.h (vec_info_shared::has_zero_dep_dist): New. * tree-vectorizer.cc (vec_info_shared::vec_info_shared): Initialize has_zero_dep_dist. * tree-vect-data-refs.cc (vect_analyze_data_ref_dependence): Remember if we've seen a dependence distance of zero. * tree-vect-stmts.cc (check_load_store_for_partial_vectors): When we've seen a dependence distance of zero and the vector type has constant size disable the use of partial vectors. --- gcc/tree-vect-data-refs.cc | 2 ++ gcc/tree-vect-stmts.cc | 10 ++++++++++ gcc/tree-vectorizer.cc | 1 + gcc/tree-vectorizer.h | 3 +++ 4 files changed, 16 insertions(+) diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc index ebe93832b1e..40cde95c16a 100644 --- a/gcc/tree-vect-data-refs.cc +++ b/gcc/tree-vect-data-refs.cc @@ -470,6 +470,8 @@ vect_analyze_data_ref_dependence (struct data_dependence_relation *ddr, "dependence distance == 0 between %T and %T\n", DR_REF (dra), DR_REF (drb)); + loop_vinfo->shared->has_zero_dep_dist = true; + /* When we perform grouped accesses and perform implicit CSE by detecting equal accesses and doing disambiguation with runtime alias tests like for diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc index d642d3c257f..3bcbc000323 100644 --- a/gcc/tree-vect-stmts.cc +++ b/gcc/tree-vect-stmts.cc @@ -1839,6 +1839,16 @@ check_load_store_for_partial_vectors (loop_vec_info loop_vinfo, tree vectype, using_partial_vectors_p = true; } + if (loop_vinfo->shared->has_zero_dep_dist + && TYPE_VECTOR_SUBPARTS (vectype).is_constant ()) + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, + "disabling partial vectors because of possible " + "STLF issues\n"); + LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false; + } + if (!using_partial_vectors_p) { if (dump_enabled_p ()) diff --git a/gcc/tree-vectorizer.cc b/gcc/tree-vectorizer.cc index a048e9d8917..74457259b6e 100644 --- a/gcc/tree-vectorizer.cc +++ b/gcc/tree-vectorizer.cc @@ -478,6 +478,7 @@ vec_info::~vec_info () vec_info_shared::vec_info_shared () : n_stmts (0), + has_zero_dep_dist (false), datarefs (vNULL), datarefs_copy (vNULL), ddrs (vNULL) diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index a36974c2c0d..7626cda2a73 100644 --- a/gcc/tree-vectorizer.h +++ b/gcc/tree-vectorizer.h @@ -419,6 +419,9 @@ public: /* The number of scalar stmts. */ unsigned n_stmts; + /* Whether there's a dependence with zero distance. */ + bool has_zero_dep_dist; + /* All data references. Freed by free_data_refs, so not an auto_vec. */ vec datarefs; vec datarefs_copy; -- 2.35.3