From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29]) by sourceware.org (Postfix) with ESMTPS id 7CE173853809 for ; Mon, 2 Aug 2021 13:40:37 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 7CE173853809 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=suse.de Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 4802D1FF98; Mon, 2 Aug 2021 13:40:36 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1627911636; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type; bh=nAaT88l++SkAtCmmR/Gpf7M7qNphhQ8KRpi88THm+XA=; b=srgpjV3mu6K3V8ezwBcl7DNhadiPSwkBxOgNnTQpck1vDWBE8AzS2eOFxJzRnKx3EdWAVo oiJXKwFObTS2TSFS3L/WiKb8UBUxB52/lP6hoSL8CVi0gUle9kB8Ga5In6RDFTc5NVmXIE 68Af3vUWSuK95/vSEzc9NttICUEZEJs= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1627911636; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type; bh=nAaT88l++SkAtCmmR/Gpf7M7qNphhQ8KRpi88THm+XA=; b=gT9lKNy7IEchaeEk4oq5FhogBpfLfFLHVSi7SdnVFQvDJsNhSh0h33gpY3EO4zS94TRnVc 4P/fLAv2u+MKkhDA== Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 2C8EA13C73; Mon, 2 Aug 2021 13:40:36 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id ykmuCdT1B2FNAQAAMHmgww (envelope-from ); Mon, 02 Aug 2021 13:40:36 +0000 Date: Mon, 2 Aug 2021 15:40:35 +0200 (CEST) From: Richard Biener To: gcc-patches@gcc.gnu.org cc: richard.sandiford@arm.com, hongtao.liu@intel.com Subject: [PATCH 2/2] Rewrite more vector loads to scalar loads Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Spam-Status: No, score=-11.3 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 02 Aug 2021 13:40:39 -0000 This teaches forwprop to rewrite more vector loads that are only used in BIT_FIELD_REFs as scalar loads. This provides the remaining uplift to SPEC CPU 2017 510.parest_r on Zen 2 which has CPU gathers disabled. In particular vector load + vec_unpack + bit-field-ref is turned into (extending) scalar loads which avoids costly XMM/GPR transitions. To not conflict with vector load + bit-field-ref + vector constructor matching to vector load + shuffle the extended transform is only done after vector lowering. Overall the two patches provide a 22% speedup of 510.parest_r. I'm in the process of confirming speedups of 500.perlbench_r, 557.xz_r, 549.fotonik3d_r and 554.roms_r as well as slowdowns of 503.bwaves_r, 507.cactuBSSN_r and 538.imagick_r. 2021-07-30 Richard Biener * tree-ssa-forwprop.c (pass_forwprop::execute): Split out code to decompose vector loads ... (optimize_vector_load): ... here. Generalize it to handle intermediate widening and TARGET_MEM_REF loads and apply it to loads with a supported vector mode as well. * gcc.target/i386/vect-gather-1.c: Amend. --- gcc/testsuite/gcc.target/i386/vect-gather-1.c | 4 +- gcc/tree-ssa-forwprop.c | 244 +++++++++++++----- 2 files changed, 185 insertions(+), 63 deletions(-) diff --git a/gcc/testsuite/gcc.target/i386/vect-gather-1.c b/gcc/testsuite/gcc.target/i386/vect-gather-1.c index 134aef39666..261b66be061 100644 --- a/gcc/testsuite/gcc.target/i386/vect-gather-1.c +++ b/gcc/testsuite/gcc.target/i386/vect-gather-1.c @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-Ofast -msse2 -fdump-tree-vect-details" } */ +/* { dg-options "-Ofast -msse2 -fdump-tree-vect-details -fdump-tree-forwprop4" } */ #ifndef INDEXTYPE #define INDEXTYPE int @@ -16,3 +16,5 @@ double vmul(INDEXTYPE *rowstart, INDEXTYPE *rowend, /* With gather emulation this should be profitable to vectorize even with plain SSE2. */ /* { dg-final { scan-tree-dump "loop vectorized" "vect" } } */ +/* The index vector loads and promotions should be scalar after forwprop. */ +/* { dg-final { scan-tree-dump-not "vec_unpack" "forwprop4" } } */ diff --git a/gcc/tree-ssa-forwprop.c b/gcc/tree-ssa-forwprop.c index db3b18b275c..bd64b8e46bc 100644 --- a/gcc/tree-ssa-forwprop.c +++ b/gcc/tree-ssa-forwprop.c @@ -2757,6 +2757,182 @@ simplify_vector_constructor (gimple_stmt_iterator *gsi) } +/* Rewrite the vector load at *GSI to component-wise loads if the load + is only used in BIT_FIELD_REF extractions with eventual intermediate + widening. */ + +static void +optimize_vector_load (gimple_stmt_iterator *gsi) +{ + gimple *stmt = gsi_stmt (*gsi); + tree lhs = gimple_assign_lhs (stmt); + tree rhs = gimple_assign_rhs1 (stmt); + + /* Gather BIT_FIELD_REFs to rewrite, looking through + VEC_UNPACK_{LO,HI}_EXPR. */ + use_operand_p use_p; + imm_use_iterator iter; + bool rewrite = true; + auto_vec bf_stmts; + auto_vec worklist; + worklist.quick_push (lhs); + do + { + tree def = worklist.pop (); + unsigned HOST_WIDE_INT def_eltsize + = TREE_INT_CST_LOW (TYPE_SIZE (TREE_TYPE (TREE_TYPE (def)))); + FOR_EACH_IMM_USE_FAST (use_p, iter, def) + { + gimple *use_stmt = USE_STMT (use_p); + if (is_gimple_debug (use_stmt)) + continue; + if (!is_gimple_assign (use_stmt)) + { + rewrite = false; + break; + } + enum tree_code use_code = gimple_assign_rhs_code (use_stmt); + tree use_rhs = gimple_assign_rhs1 (use_stmt); + if (use_code == BIT_FIELD_REF + && TREE_OPERAND (use_rhs, 0) == def + /* If its on the VEC_UNPACK_{HI,LO}_EXPR + def need to verify it is element aligned. */ + && (def == lhs + || (known_eq (bit_field_size (use_rhs), def_eltsize) + && constant_multiple_p (bit_field_offset (use_rhs), + def_eltsize)))) + { + bf_stmts.safe_push (use_stmt); + continue; + } + /* Walk through one level of VEC_UNPACK_{LO,HI}_EXPR. */ + if (def == lhs + && (use_code == VEC_UNPACK_HI_EXPR + || use_code == VEC_UNPACK_LO_EXPR) + && use_rhs == lhs) + { + worklist.safe_push (gimple_assign_lhs (use_stmt)); + continue; + } + rewrite = false; + break; + } + if (!rewrite) + break; + } + while (!worklist.is_empty ()); + + if (!rewrite) + { + gsi_next (gsi); + return; + } + /* We now have all ultimate uses of the load to rewrite in bf_stmts. */ + + /* Prepare the original ref to be wrapped in adjusted BIT_FIELD_REFs. + For TARGET_MEM_REFs we have to separate the LEA from the reference. */ + tree load_rhs = rhs; + if (TREE_CODE (load_rhs) == TARGET_MEM_REF) + { + if (TREE_CODE (TREE_OPERAND (load_rhs, 0)) == ADDR_EXPR) + mark_addressable (TREE_OPERAND (TREE_OPERAND (load_rhs, 0), 0)); + tree tem = make_ssa_name (TREE_TYPE (TREE_OPERAND (load_rhs, 0))); + gimple *new_stmt + = gimple_build_assign (tem, build1 (ADDR_EXPR, TREE_TYPE (tem), + unshare_expr (load_rhs))); + gsi_insert_before (gsi, new_stmt, GSI_SAME_STMT); + load_rhs = build2_loc (EXPR_LOCATION (load_rhs), + MEM_REF, TREE_TYPE (load_rhs), tem, + build_int_cst + (TREE_TYPE (TREE_OPERAND (load_rhs, 1)), 0)); + } + + /* Rewrite the BIT_FIELD_REFs to be actual loads, re-emitting them at + the place of the original load. */ + for (gimple *use_stmt : bf_stmts) + { + tree bfr = gimple_assign_rhs1 (use_stmt); + tree new_rhs = unshare_expr (load_rhs); + if (TREE_OPERAND (bfr, 0) != lhs) + { + /* When the BIT_FIELD_REF is on the promoted vector we have to + adjust it and emit a conversion afterwards. */ + gimple *def_stmt + = SSA_NAME_DEF_STMT (TREE_OPERAND (bfr, 0)); + enum tree_code def_code + = gimple_assign_rhs_code (def_stmt); + + /* The adjusted BIT_FIELD_REF is of the promotion source + vector size and at half of the offset... */ + new_rhs = fold_build3 (BIT_FIELD_REF, + TREE_TYPE (TREE_TYPE (lhs)), + new_rhs, + TYPE_SIZE (TREE_TYPE (TREE_TYPE (lhs))), + size_binop (EXACT_DIV_EXPR, + TREE_OPERAND (bfr, 2), + bitsize_int (2))); + /* ... and offsetted by half of the vector if VEC_UNPACK_HI_EXPR. */ + if (def_code == (!BYTES_BIG_ENDIAN + ? VEC_UNPACK_HI_EXPR : VEC_UNPACK_LO_EXPR)) + TREE_OPERAND (new_rhs, 2) + = size_binop (PLUS_EXPR, TREE_OPERAND (new_rhs, 2), + size_binop (EXACT_DIV_EXPR, + TYPE_SIZE (TREE_TYPE (lhs)), + bitsize_int (2))); + tree tem = make_ssa_name (TREE_TYPE (TREE_TYPE (lhs))); + gimple *new_stmt = gimple_build_assign (tem, new_rhs); + location_t loc = gimple_location (use_stmt); + gimple_set_location (new_stmt, loc); + gsi_insert_before (gsi, new_stmt, GSI_SAME_STMT); + /* Perform scalar promotion. */ + new_stmt = gimple_build_assign (gimple_assign_lhs (use_stmt), + NOP_EXPR, tem); + gimple_set_location (new_stmt, loc); + gsi_insert_before (gsi, new_stmt, GSI_SAME_STMT); + } + else + { + /* When the BIT_FIELD_REF is on the original load result + we can just wrap that. */ + tree new_rhs = fold_build3 (BIT_FIELD_REF, TREE_TYPE (bfr), + unshare_expr (load_rhs), + TREE_OPERAND (bfr, 1), + TREE_OPERAND (bfr, 2)); + gimple *new_stmt = gimple_build_assign (gimple_assign_lhs (use_stmt), + new_rhs); + location_t loc = gimple_location (use_stmt); + gimple_set_location (new_stmt, loc); + gsi_insert_before (gsi, new_stmt, GSI_SAME_STMT); + } + gimple_stmt_iterator gsi2 = gsi_for_stmt (use_stmt); + unlink_stmt_vdef (use_stmt); + gsi_remove (&gsi2, true); + } + + /* Finally get rid of the intermediate stmts. */ + gimple *use_stmt; + FOR_EACH_IMM_USE_STMT (use_stmt, iter, lhs) + { + if (is_gimple_debug (use_stmt)) + { + if (gimple_debug_bind_p (use_stmt)) + { + gimple_debug_bind_reset_value (use_stmt); + update_stmt (use_stmt); + } + continue; + } + gimple_stmt_iterator gsi2 = gsi_for_stmt (use_stmt); + unlink_stmt_vdef (use_stmt); + release_defs (use_stmt); + gsi_remove (&gsi2, true); + } + /* And the original load. */ + release_defs (stmt); + gsi_remove (gsi, true); +} + + /* Primitive "lattice" function for gimple_simplify. */ static tree @@ -3007,71 +3183,15 @@ pass_forwprop::execute (function *fun) gsi_next (&gsi); } else if (TREE_CODE (TREE_TYPE (lhs)) == VECTOR_TYPE - && TYPE_MODE (TREE_TYPE (lhs)) == BLKmode + && (TYPE_MODE (TREE_TYPE (lhs)) == BLKmode + /* After vector lowering rewrite all loads, but + initially do not since this conflicts with + vector CONSTRUCTOR to shuffle optimization. */ + || (fun->curr_properties & PROP_gimple_lvec)) && gimple_assign_load_p (stmt) && !gimple_has_volatile_ops (stmt) - && (TREE_CODE (gimple_assign_rhs1 (stmt)) - != TARGET_MEM_REF) && !stmt_can_throw_internal (cfun, stmt)) - { - /* Rewrite loads used only in BIT_FIELD_REF extractions to - component-wise loads. */ - use_operand_p use_p; - imm_use_iterator iter; - bool rewrite = true; - FOR_EACH_IMM_USE_FAST (use_p, iter, lhs) - { - gimple *use_stmt = USE_STMT (use_p); - if (is_gimple_debug (use_stmt)) - continue; - if (!is_gimple_assign (use_stmt) - || gimple_assign_rhs_code (use_stmt) != BIT_FIELD_REF - || TREE_OPERAND (gimple_assign_rhs1 (use_stmt), 0) != lhs) - { - rewrite = false; - break; - } - } - if (rewrite) - { - gimple *use_stmt; - FOR_EACH_IMM_USE_STMT (use_stmt, iter, lhs) - { - if (is_gimple_debug (use_stmt)) - { - if (gimple_debug_bind_p (use_stmt)) - { - gimple_debug_bind_reset_value (use_stmt); - update_stmt (use_stmt); - } - continue; - } - - tree bfr = gimple_assign_rhs1 (use_stmt); - tree new_rhs = fold_build3 (BIT_FIELD_REF, - TREE_TYPE (bfr), - unshare_expr (rhs), - TREE_OPERAND (bfr, 1), - TREE_OPERAND (bfr, 2)); - gimple *new_stmt - = gimple_build_assign (gimple_assign_lhs (use_stmt), - new_rhs); - - location_t loc = gimple_location (use_stmt); - gimple_set_location (new_stmt, loc); - gimple_stmt_iterator gsi2 = gsi_for_stmt (use_stmt); - unlink_stmt_vdef (use_stmt); - gsi_remove (&gsi2, true); - - gsi_insert_before (&gsi, new_stmt, GSI_SAME_STMT); - } - - release_defs (stmt); - gsi_remove (&gsi, true); - } - else - gsi_next (&gsi); - } + optimize_vector_load (&gsi); else if (code == COMPLEX_EXPR) { -- 2.31.1