From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-1.mimecast.com (us-smtp-1.mimecast.com [207.211.31.81]) by sourceware.org (Postfix) with ESMTP id 4531F385702B for ; Thu, 27 Aug 2020 12:25:01 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 4531F385702B Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-465-v50d00aDNeWsaIY1J59byg-1; Thu, 27 Aug 2020 08:24:57 -0400 X-MC-Unique: v50d00aDNeWsaIY1J59byg-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 34DFB107464C; Thu, 27 Aug 2020 12:24:56 +0000 (UTC) Received: from tucnak.zalov.cz (ovpn-113-115.ams2.redhat.com [10.36.113.115]) by smtp.corp.redhat.com (Postfix) with ESMTPS id C505419930; Thu, 27 Aug 2020 12:24:55 +0000 (UTC) Received: from tucnak.zalov.cz (localhost [127.0.0.1]) by tucnak.zalov.cz (8.15.2/8.15.2) with ESMTP id 07RCOqFT032221; Thu, 27 Aug 2020 14:24:52 +0200 Received: (from jakub@localhost) by tucnak.zalov.cz (8.15.2/8.15.2/Submit) id 07RCOqia032220; Thu, 27 Aug 2020 14:24:52 +0200 Date: Thu, 27 Aug 2020 14:24:52 +0200 From: Jakub Jelinek To: Hongtao Liu Cc: GCC Patches Subject: Re: [PATCH] [AVX512] [PR87767] Optimize memory broadcast for constant vector under AVX512 Message-ID: <20200827122452.GN2961@tucnak> Reply-To: Jakub Jelinek References: MIME-Version: 1.0 In-Reply-To: User-Agent: Mutt/1.11.3 (2019-02-01) X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 X-Mimecast-Spam-Score: 0.002 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=us-ascii Content-Disposition: inline X-Spam-Status: No, score=-6.4 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, KAM_NUMSUBJECT, KAM_STOCKGEN, RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=no autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 27 Aug 2020 12:25:02 -0000 On Thu, Jul 09, 2020 at 04:33:46PM +0800, Hongtao Liu via Gcc-patches wrote: > +static void > +replace_constant_pool_with_broadcast (rtx_insn* insn) > +{ > + subrtx_ptr_iterator::array_type array; > + FOR_EACH_SUBRTX_PTR (iter, array, &PATTERN (insn), ALL) > + { > + rtx *loc = *iter; > + rtx x = *loc; > + rtx broadcast_mem, vec_dup, constant, first; > + machine_mode mode; > + if (GET_CODE (x) != MEM MEM_P > + || GET_CODE (XEXP (x, 0)) != SYMBOL_REF SYMBOL_REF_P > + || !CONSTANT_POOL_ADDRESS_P (XEXP (x, 0))) > + continue; > + > + mode = GET_MODE (x); > + if (!VECTOR_MODE_P (mode)) > + return; > + > + constant = get_pool_constant (XEXP (x, 0)); > + first = XVECEXP (constant, 0, 0); Shouldn't this verify first that GET_CODE (constant) == CONST_VECTOR and punt otherwise? > + broadcast_mem = force_const_mem (GET_MODE_INNER (mode), first); > + vec_dup = gen_rtx_VEC_DUPLICATE (mode, broadcast_mem); > + *loc = vec_dup; > + INSN_CODE (insn) = -1; > + /* Revert change if there's no corresponding pattern. */ > + if (recog_memoized (insn) < 0) > + { > + *loc = x; > + recog_memoized (insn); > + } The usual way of doing this would be through validate_change (insn, loc, vec_dup, 0); Also, isn't the pass also useful for TARGET_AVX and above (but in that case only if it is a simple memory load)? Or are avx/avx2 broadcast slower than full vector loads? As Jeff wrote, I wonder if when successfully replacing those pool constants the old constant pool entries will be omitted. Another thing I wonder about is whether more analysis shouldn't be used. E.g. if the constant pool entry is already emitted into .rodata anyway (e.g. some earlier function needed it), using the broadcast will mean actually larger .rodata. If {1to8} and similar is as fast as reading all the same elements from memory (or faster), perhaps in that case it should broadcast from the first element of the existing constant pool full vector rather than creating a new one. And similarly, perhaps the function should look at all constant pool entries in the current function (not yet emitted into .rodata) and if it would succeed for some and not for others, either use broadcast from its first element or not perform it for the others too. Jakub