From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id 8BF223858D33 for ; Mon, 15 May 2023 18:59:06 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 8BF223858D33 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id DF3E92F4; Mon, 15 May 2023 11:59:50 -0700 (PDT) Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id B6A033F663; Mon, 15 May 2023 11:59:05 -0700 (PDT) From: Richard Sandiford To: Prathamesh Kulkarni Mail-Followup-To: Prathamesh Kulkarni ,gcc Patches , richard.sandiford@arm.com Cc: gcc Patches Subject: Re: [aarch64] Code-gen for vector initialization involving constants References: Date: Mon, 15 May 2023 19:59:04 +0100 In-Reply-To: (Prathamesh Kulkarni's message of "Mon, 15 May 2023 19:39:14 +0530") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Spam-Status: No, score=-23.3 required=5.0 tests=BAYES_00,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Prathamesh Kulkarni writes: > Hi Richard, > After committing the interleave+zip1 patch for vector initialization, > it seems to regress the s32 case for this patch: > > int32x4_t f_s32(int32_t x) > { > return (int32x4_t) { x, x, x, 1 }; > } > > code-gen: > f_s32: > movi v30.2s, 0x1 > fmov s31, w0 > dup v0.2s, v31.s[0] > ins v30.s[0], v31.s[0] > zip1 v0.4s, v0.4s, v30.4s > ret > > instead of expected code-gen: > f_s32: > movi v31.2s, 0x1 > dup v0.4s, w0 > ins v0.s[3], v31.s[0] > ret > > Cost for fallback sequence: 16 > Cost for interleave and zip sequence: 12 > > For the above case, the cost for interleave+zip1 sequence is computed as: > halves[0]: > (set (reg:V2SI 96) > (vec_duplicate:V2SI (reg/v:SI 93 [ x ]))) > cost = 8 > > halves[1]: > (set (reg:V2SI 97) > (const_vector:V2SI [ > (const_int 1 [0x1]) repeated x2 > ])) > (set (reg:V2SI 97) > (vec_merge:V2SI (vec_duplicate:V2SI (reg/v:SI 93 [ x ])) > (reg:V2SI 97) > (const_int 1 [0x1]))) > cost = 8 > > followed by: > (set (reg:V4SI 95) > (unspec:V4SI [ > (subreg:V4SI (reg:V2SI 96) 0) > (subreg:V4SI (reg:V2SI 97) 0) > ] UNSPEC_ZIP1)) > cost = 4 > > So the total cost becomes > max(costs[0], costs[1]) + zip1_insn_cost > = max(8, 8) + 4 > = 12 > > While the fallback rtl sequence is: > (set (reg:V4SI 95) > (vec_duplicate:V4SI (reg/v:SI 93 [ x ]))) > cost = 8 > (set (reg:SI 98) > (const_int 1 [0x1])) > cost = 4 > (set (reg:V4SI 95) > (vec_merge:V4SI (vec_duplicate:V4SI (reg:SI 98)) > (reg:V4SI 95) > (const_int 8 [0x8]))) > cost = 4 > > So total cost = 8 + 4 + 4 = 16, and we choose the interleave+zip1 sequence. > > I think the issue is probably that for the interleave+zip1 sequence we take > max(costs[0], costs[1]) to reflect that both halves are interleaved, > but for the fallback seq we use seq_cost, which assumes serial execution > of insns in the sequence. > For above fallback sequence, > set (reg:V4SI 95) > (vec_duplicate:V4SI (reg/v:SI 93 [ x ]))) > and > (set (reg:SI 98) > (const_int 1 [0x1])) > could be executed in parallel, which would make it's cost max(8, 4) + 4 = 12. Agreed. A good-enough substitute for this might be to ignore scalar moves (for both alternatives) when costing for speed. > I was wondering if we should we make cost for interleave+zip1 sequence > more conservative > by not taking max, but summing up costs[0] + costs[1] even for speed ? > For this case, > that would be 8 + 8 + 4 = 20. > > It generates the fallback sequence for other cases (s8, s16, s64) from > the test-case. What does it do for the tests in the interleave+zip1 patch? If it doesn't make a difference there then it sounds like we don't have enough tests. :) Summing is only conservative if the fallback sequence is somehow "safer". But I don't think it is. Building an N-element vector from N scalars can be done using N instructions in the fallback case and N+1 instructions in the interleave+zip1 case. But the interleave+zip1 case is still better (speedwise) for N==16. Thanks, Richard